Biopython - Bioinformatics Library
Industry-standard Python library for computational biology and bioinformatics workflows.
When to Use
- Parsing and manipulating biological sequences (DNA, RNA, protein)
- Reading and writing sequence files (FASTA, FASTQ, GenBank, EMBL, SwissProt)
- Performing sequence alignments (pairwise and multiple)
- Running and parsing BLAST searches
- Analyzing protein structures from PDB files
- Calculating sequence statistics and molecular weights
- Translating DNA to protein sequences
- Finding restriction enzyme sites
- Building and analyzing phylogenetic trees
- Accessing NCBI databases (Entrez, PubMed)
- Computing sequence motifs and patterns
- Analyzing next-generation sequencing data
Reference Documentation
Official docs: https://biopython.org/
Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
Search patterns: SeqIO.parse, Seq, AlignIO, NCBIWWW.qblast, PDBParser
Core Principles
Use Biopython For
| Task | Module | Example |
|---|---|---|
| Create sequences | Seq | Seq("ATCG") |
| Read sequence files | SeqIO | SeqIO.parse("file.fasta", "fasta") |
| Pairwise alignment | pairwise2 | pairwise2.align.globalxx(s1, s2) |
| Multiple alignment | AlignIO | AlignIO.read("align.fasta", "fasta") |
| BLAST searches | NCBIWWW | NCBIWWW.qblast("blastn", "nr", seq) |
| PDB structures | PDB.PDBParser | PDBParser().get_structure() |
| Phylogenetic trees | Phylo | Phylo.read("tree.xml", "phyloxml") |
| NCBI databases | Entrez | Entrez.esearch(db="nucleotide") |
Do NOT Use For
- High-performance genome assembly (use SPAdes, Canu)
- Variant calling from BAM files (use GATK, BCFtools)
- RNA-seq differential expression (use DESeq2, edgeR)
- Protein structure prediction (use AlphaFold, RoseTTAFold)
- Large-scale metagenomics (use specialized pipelines)
Quick Reference
Installation
# pip (recommended)
pip install biopython
# With optional dependencies
pip install biopython[extra]
# conda
conda install -c conda-forge biopython
# Development version
pip install git+https://github.com/biopython/biopython.git
Standard Imports
# Core sequence handling
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO, AlignIO
# Sequence alignment
from Bio import pairwise2
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalwCommandline
# BLAST
from Bio.Blast import NCBIWWW, NCBIXML
# Structure analysis
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB.DSSP import DSSP
from Bio.PDB.Polypeptide import PPBuilder
# Phylogenetics
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# NCBI Entrez
from Bio import Entrez
# Additional tools
from Bio.SeqUtils import GC, molecular_weight
from Bio.Restriction import *
Basic Pattern - Sequence Creation
from Bio.Seq import Seq
# Create DNA sequence
dna = Seq("ATGGCCATTGTAATGGGCCGC")
# Transcribe to RNA
rna = dna.transcribe()
# Translate to protein
protein = dna.translate()
# Reverse complement
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"RevComp: {rev_comp}")
Basic Pattern - File Reading
from Bio import SeqIO
# Read FASTA file (iterator - memory efficient)
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Sequence: {record.seq[:50]}...")
# Read single sequence
record = SeqIO.read("single.fasta", "fasta")
Basic Pattern - Sequence Alignment
from Bio import pairwise2
from Bio.Seq import Seq
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
# Global alignment
alignments = pairwise2.align.globalxx(seq1, seq2)
# Print best alignment
best = alignments[0]
print(pairwise2.format_alignment(*best))
Critical Rules
✅ DO
- Use iterators for large files -
SeqIO.parse()notSeqIO.to_dict() - Validate sequences - Check alphabet and length before operations
- Handle file formats correctly - Match parser to actual file format
- Check alignment quality - Verify gaps and identity percentages
- Use appropriate genetic code - Specify table for translation
- Close file handles - Use context managers or explicit close
- Filter FASTQ by quality - Don't trust all reads equally
- Verify PDB structure - Check for missing atoms/residues
- Set Entrez email - Required for NCBI API usage
- Handle translation frames - Consider all three reading frames
❌ DON'T
- Load entire FASTQ into memory - Use streaming
- Ignore sequence type - DNA, RNA, and protein need different handling
- Skip quality filtering - FASTQ quality scores matter
- Use wrong genetic code - Different organisms use different tables
- Forget stop codons - Handle them explicitly in translation
- Mix alphabets - Don't compare DNA with protein sequences
- Trust all BLAST hits - Filter by e-value and identity
- Ignore chain breaks - PDB structures may have gaps
- Hammer NCBI servers - Use rate limiting (3 requests/sec without API key)
- Compare raw sequences - Align first, then compare
Anti-Patterns (NEVER)
# ❌ BAD: Loading entire file into memory
records = list(SeqIO.parse("huge.fastq", "fastq"))
for record in records:
process(record) # OOM for large files!
# ✅ GOOD: Stream processing
for record in SeqIO.parse("huge.fastq", "fastq"):
if min(record.letter_annotations["phred_quality"]) >= 20:
process(record)
# ❌ BAD: Translating without checking frame
protein = dna_seq.translate()
# May include stop codons or wrong frame!
# ✅ GOOD: Specify frame and stop codon handling
protein = dna_seq.translate(to_stop=True, table=1)
# ❌ BAD: No email for Entrez
Entrez.esearch(db="nucleotide", term="human")
# NCBI will block you!
# ✅ GOOD: Always set email
Entrez.email = "your.email@example.com"
handle = Entrez.esearch(db="nucleotide", term="human")
# ❌ BAD: Comparing sequences directly
if str(seq1) == str(seq2):
print("Same") # Ignores gaps, misalignments!
# ✅ GOOD: Align then compare
from Bio import pairwise2
alignments = pairwise2.align.globalxx(seq1, seq2)
score = alignments[0][2] # Alignment score
# ❌ BAD: Wrong file format
records = SeqIO.parse("file.gb", "fasta") # Wrong!
# ✅ GOOD: Match format to file
records = SeqIO.parse("file.gb", "genbank")
Sequence Objects
Creating and Manipulating Sequences
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
# DNA sequence
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Basic operations
length = len(dna)
gc_count = dna.count("G") + dna.count("C")
gc_content = (gc_count / length) * 100
# Find subsequence
position = dna.find("ATG") # Returns index or -1
# Slicing
first_codon = dna[:3]
last_10 = dna[-10:]
print(f"Length: {length}")
print(f"GC content: {gc_content:.2f}%")
print(f"ATG at position: {position}")
Transcription and Translation
from Bio.Seq import Seq
# DNA to RNA transcription
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
rna = dna.transcribe()
# RNA to protein translation
protein = rna.translate()
# DNA to protein (direct)
protein_direct = dna.translate()
# Back-transcription
dna_back = rna.back_transcribe()
# Reverse complement
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"Rev Comp: {rev_comp}")
Translation with Different Genetic Codes
from Bio.Seq import Seq
dna = Seq("ATGGGCTAG")
# Standard genetic code (table 1)
protein_standard = dna.translate(table=1)
# Mitochondrial genetic code (table 2)
protein_mito = dna.translate(table=2)
# Stop at first stop codon
protein_to_stop = dna.translate(to_stop=True)
# All three reading frames
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True)
print(f"Frame {i}: {protein}")
Sequence Utilities
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight
seq = Seq("ATGGCCATTGTAATGGGCCGC")
# GC content
gc = GC(seq)
# Molecular weight
mw = molecular_weight(seq, seq_type='DNA')
# GC skew (for origin of replication analysis)
from Bio.SeqUtils import GC_skew
skew = GC_skew(seq, window=100)
print(f"GC content: {gc:.2f}%")
print(f"Molecular weight: {mw:.2f} Da")
SeqRecord Objects
Creating SeqRecord Objects
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
# Create with minimal info
seq = Seq("ATGGCCATTG")
record = SeqRecord(seq, id="seq1", description="My sequence")
# With full annotation
record = SeqRecord(
Seq("ATGGCCATTG"),
id="gene1",
name="GeneName",
description="Gene encoding protein X",
annotations={
"organism": "Homo sapiens",
"molecule_type": "DNA"
}
)
# Add features
from Bio.SeqFeature import SeqFeature, FeatureLocation
feature = SeqFeature(
FeatureLocation(0, 10),
type="CDS",
qualifiers={"product": "hypothetical protein"}
)
record.features.append(feature)
print(record)
SeqRecord Attributes
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")
# Access components
sequence = record.seq
identifier = record.id
name = record.name
description = record.description
# Annotations (dictionary)
organism = record.annotations.get("organism", "Unknown")
# Features (list)
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
location = feature.location
print(f"{product} at {location}")
# Letter annotations (quality scores for FASTQ)
if "phred_quality" in record.letter_annotations:
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"Average quality: {avg_quality:.2f}")
File Input/Output
Reading FASTA Files
from Bio import SeqIO
# Single sequence
record = SeqIO.read("single.fasta", "fasta")
print(f"{record.id}: {record.seq}")
# Multiple sequences (iterator)
for record in SeqIO.parse("multiple.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
# Load into dictionary (use sparingly!)
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
seq1 = sequences["seq1"]
Writing FASTA Files
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
# Create records
records = []
for i in range(10):
seq = Seq("ATCG" * (i + 1))
record = SeqRecord(seq, id=f"seq_{i}", description=f"Sequence {i}")
records.append(record)
# Write to file
SeqIO.write(records, "output.fasta", "fasta")
# Append to existing file
with open("output.fasta", "a") as f:
SeqIO.write(record, f, "fasta")
Reading FASTQ Files
from Bio import SeqIO
# Read FASTQ with quality scores
for record in SeqIO.parse("reads.fastq", "fastq"):
seq_id = record.id
sequence = record.seq
quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)
print(f"{seq_id}: avg Q={avg_q:.1f}, min Q={min_q}")
Quality Filtering FASTQ
from Bio import SeqIO
def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
"""Filter FASTQ by quality and length."""
good_reads = 0
with open(output_file, "w") as out_handle:
for record in SeqIO.parse(input_file, "fastq"):
# Check length
if len(record.seq) < min_length:
continue
# Check quality
qualities = record.letter_annotations["phred_quality"]
if min(qualities) >= min_quality:
SeqIO.write(record, out_handle, "fastq")
good_reads += 1
return good_reads
# Filter reads
n_passed = filter_fastq("raw.fastq", "filtered.fastq")
print(f"Passed: {n_passed} reads")
Reading GenBank Files
from Bio import SeqIO
# GenBank files have rich annotations
for record in SeqIO.parse("sequence.gb", "genbank"):
print(f"ID: {record.id}")
print(f"Organism: {record.annotations['organism']}")
print(f"Sequence length: {len(record.seq)}")
# Extract features
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
start = feature.location.start
end = feature.location.end
print(f" CDS: {product} ({start}-{end})")
Converting File Formats
from Bio import SeqIO
# Convert GenBank to FASTA
records = SeqIO.parse("input.gb", "genbank")
SeqIO.write(records, "output.fasta", "fasta")
# Convert FASTQ to FASTA (lose quality scores)
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "sequences.fasta", "fasta")
# Multiple format conversions in one go
def convert_format(input_file, input_format, output_file, output_format):
"""Generic format converter."""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"Converted {count} records")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
Sequence Alignment
Pairwise Alignment
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")
# Global alignment (Needleman-Wunsch)
alignments = pairwise2.align.globalxx(seq1, seq2)
# Print best alignment
best = alignments[0]
print(format_alignment(*best))
# With scoring matrix
alignments = pairwise2.align.globalms(
seq1, seq2,
match=2, # Match score
mismatch=-1, # Mismatch penalty
open=-0.5, # Gap open penalty
extend=-0.1 # Gap extension penalty
)
Custom Scoring
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")
# Use BLOSUM62 matrix
matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx(
seq1, seq2,
matrix,
open=-10,
extend=-0.5
)
print(pairwise2.format_alignment(*alignments[0]))
Local Alignment
from Bio import pairwise2
from Bio.Seq import Seq
# Local alignment (Smith-Waterman)
seq1 = Seq("GCATGCTAGATGCTA")
seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
# Best local alignment
best = alignments[0]
print(pairwise2.format_alignment(*best))
Multiple Sequence Alignment
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandline
# Run ClustalW (requires installation)
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta")
stdout, stderr = cline()
# Read alignment
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"Alignment length: {alignment.get_alignment_length()}")
print(f"Number of sequences: {len(alignment)}")
# Print alignment
print(alignment)
# Access sequences
for record in alignment:
print(f"{record.id}: {record.seq}")
Alignment Statistics
from Bio import AlignIO
alignment = AlignIO.read("alignment.fasta", "fasta")
# Calculate identity
def calculate_identity(alignment):
"""Calculate pairwise identity matrix."""
n = len(alignment)
identities = []
for i in range(n):
for j in range(i+1, n):
seq1 = alignment[i].seq
seq2 = alignment[j].seq
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
identity = matches / aligned_length * 100 if aligned_length > 0 else 0
identities.append(identity)
return identities
identities = calculate_identity(alignment)
print(f"Average identity: {sum(identities)/len(identities):.2f}%")
BLAST Searches
Running BLAST Online
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
# Read query sequence
record = SeqIO.read("query.fasta", "fasta")
# Run BLAST
result_handle = NCBIWWW.qblast(
program="blastn", # blastn, blastp, blastx, tblastn, tblastx
database="nt", # nt, nr, refseq_rna, etc.
sequence=str(record.seq),
hitlist_size=10
)
# Save results
with open("blast_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
Parsing BLAST Results
from Bio.Blast import NCBIXML
# Parse BLAST XML output
with open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"Query: {blast_record.query}")
print(f"Number of hits: {len(blast_record.alignments)}")
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001: # E-value threshold
print(f"\n Hit: {alignment.title}")
print(f" Length: {alignment.length}")
print(f" E-value: {hsp.expect}")
print(f" Identity: {hsp.identities}/{hsp.align_length} "
f"({hsp.identities/hsp.align_length*100:.1f}%)")
print(f" Query: {hsp.query}")
print(f" Subject: {hsp.sbjct}")
BLAST with Custom Parameters
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
program="blastp",
database="nr",
sequence=protein_seq,
expect=0.001, # E-value threshold
word_size=3, # Word size
matrix_name="BLOSUM62", # Substitution matrix
gapcosts="11 1", # Gap costs (open, extend)
hitlist_size=50, # Number of hits
filter="L" # Low complexity filter
)
Batch BLAST
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time
def batch_blast(fasta_file, output_file):
"""Run BLAST for multiple sequences."""
results = []
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"BLASTing {record.id}...")
result_handle = NCBIWWW.qblast(
"blastn", "nt",
str(record.seq),
hitlist_size=5
)
blast_record = NCBIXML.read(result_handle)
results.append((record.id, blast_record))
# Be nice to NCBI servers
time.sleep(1)
return results
# Run batch BLAST
results = batch_blast("queries.fasta", "batch_results.xml")
Protein Structure Analysis
Loading PDB Files
from Bio.PDB import PDBParser
# Create parser
parser = PDBParser(QUIET=True)
# Load structure
structure = parser.get_structure("protein", "protein.pdb")
# Access hierarchy: Structure → Model → Chain → Residue → Atom
model = structure[0]
chain = model['A']
print(f"Structure ID: {structure.id}")
print(f"Number of models: {len(structure)}")
print(f"Number of chains: {len(model)}")
print(f"Number of residues in chain A: {len(chain)}")
Extracting Information
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
# Iterate through structure
for model in structure:
for chain in model:
print(f"Chain {chain.id}")
for residue in chain:
# Skip hetero atoms (water, ligands)
if residue.id[0] == ' ':
resname = residue.resname
resid = residue.id[1]
# Get atoms
for atom in residue:
atom_name = atom.name
coord = atom.coord
print(f" {resname}{resid} {atom_name}: {coord}")
Calculating Distances
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
# Get two atoms
chain = structure[0]['A']
atom1 = chain[10]['CA'] # CA atom of residue 10
atom2 = chain[20]['CA'] # CA atom of residue 20
# Calculate distance
distance = atom1 - atom2 # Overloaded operator
print(f"Distance: {distance:.2f} Å")
# Manual calculation
coord1 = atom1.coord
coord2 = atom2.coord
dist = np.linalg.norm(coord1 - coord2)
print(f"Distance (manual): {dist:.2f} Å")
Center of Mass
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
def calculate_center_of_mass(entity):
"""Calculate center of mass for Structure/Model/Chain."""
coords = []
masses = []
for atom in entity.get_atoms():
coords.append(atom.coord)
masses.append(atom.mass)
coords = np.array(coords)
masses = np.array(masses)
com = np.average(coords, axis=0, weights=masses)
return com
# Calculate for whole structure
com = calculate_center_of_mass(structure)
print(f"Center of mass: {com}")
Selecting Atoms
from Bio.PDB import PDBParser, Selection
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
# Select all atoms
all_atoms = Selection.unfold_entities(structure, 'A')
print(f"Total atoms: {len(all_atoms)}")
# Select specific atoms
model = structure[0]
# All CA atoms
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA']
print(f"CA atoms: {len(ca_atoms)}")
# Specific residue range
chain_a = model['A']
residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
Secondary Structure with DSSP
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]
# Run DSSP (requires dssp executable)
dssp = DSSP(model, "protein.pdb")
# Extract secondary structure
for key in dssp:
residue = dssp[key]
chain_id = key[0]
res_id = key[1][1]
ss = residue[2] # Secondary structure: H=helix, E=sheet, C=coil
acc = residue[3] # Accessible surface area
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")
Extracting Sequence from PDB
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
# Build polypeptides
ppb = PPBuilder()
for chain in structure[0]:
for pp in ppb.build_peptides(chain):
sequence = pp.get_sequence()
print(f"Chain {chain.id}: {sequence}")
Writing PDB Files
from Bio.PDB import PDBParser, PDBIO, Select
# Load structure
parser = PDBParser()
structure = parser.get_structure("protein", "input.pdb")
# Select what to save
class ChainSelect(Select):
def accept_chain(self, chain):
return chain.id == 'A' # Only save chain A
# Write structure
io = PDBIO()
io.set_structure(structure)
io.save("chain_A.pdb", ChainSelect())
Phylogenetic Analysis
Building Trees from Alignment
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# Read alignment
alignment = AlignIO.read("alignment.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Construct tree (Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
# Save tree
Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
Tree Visualization
from Bio import Phylo
# Load tree
tree = Phylo.read("tree.xml", "phyloxml")
# Draw tree (ASCII)
Phylo.draw_ascii(tree)
# Draw tree (graphical - requires matplotlib)
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.savefig("tree.png")
# Interactive view
# tree.ladderize() # Flip tree for better visualization
# Phylo.draw(tree, do_show=True)
Tree Analysis
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")
# Get all terminals (leaves)
terminals = tree.get_terminals()
print(f"Number of leaves: {len(terminals)}")
# Get all non-terminals (internal nodes)
nonterminals = tree.get_nonterminals()
print(f"Number of internal nodes: {len(nonterminals)}")
# Find common ancestor
clade1 = tree.find_any(name="Species1")
clade2 = tree.find_any(name="Species2")
common = tree.common_ancestor(clade1, clade2)
# Calculate distances
dist = tree.distance(clade1, clade2)
print(f"Distance between Species1 and Species2: {dist:.4f}")
# Get path between nodes
path = tree.get_path(clade1, clade2)
Creating Trees Programmatically
from Bio.Phylo.BaseTree import Tree, Clade
# Create tree structure
tree = Tree()
# Create clades (nodes)
clade_a = Clade(branch_length=0.5, name="A")
clade_b = Clade(branch_length=0.3, name="B")
clade_c = Clade(branch_length=0.4, name="C")
# Create internal node
internal = Clade(branch_length=0.2)
internal.clades = [clade_a, clade_b]
# Create root
root = Clade()
root.clades = [internal, clade_c]
tree.root = root
# Visualize
Phylo.draw_ascii(tree)
NCBI Entrez
Searching Databases
from Bio import Entrez
# ALWAYS set your email
Entrez.email = "your.email@example.com"
# Search nucleotide database
handle = Entrez.esearch(
db="nucleotide",
term="Homo sapiens[Organism] AND COX1",
retmax=10
)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} records")
print(f"IDs: {record['IdList']}")
Fetching Records
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
# Fetch by ID
handle = Entrez.efetch(
db="nucleotide",
id="NC_000001",
rettype="gb",
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"Description: {record.description}")
print(f"Length: {len(record.seq)}")
Batch Downloads
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
def download_sequences(id_list, output_file):
"""Download multiple sequences by ID."""
# Fetch records
handle = Entrez.efetch(
db="nucleotide",
id=id_list,
rettype="fasta",
retmode="text"
)
records = SeqIO.parse(handle, "fasta")
SeqIO.write(records, output_file, "fasta")
handle.close()
# Download
ids = ["NM_001301717", "NM_001301718", "NM_001301719"]
download_sequences(ids, "downloaded.fasta")
Entrez with History
from Bio import Entrez
Entrez.email = "your.email@example.com"
# Search with history
search_handle = Entrez.esearch(
db="nucleotide",
term="human[organism] AND COX1",
usehistory="y",
retmax=1000
)
search_results = Entrez.read(search_handle)
search_handle.close()
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])
print(f"Found {count} records")
# Fetch in batches
batch_size = 100
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
# Process batch
data = fetch_handle.read()
fetch_handle.close()
Advanced Workflows
Complete Gene Analysis Pipeline
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
Entrez.email = "your.email@example.com"
def analyze_gene(gene_id):
"""Complete analysis of a gene."""
# 1. Fetch sequence
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 2. Basic stats
length = len(record.seq)
gc_content = GC(record.seq)
# 3. Extract CDS
cds_list = []
for feature in record.features:
if feature.type == "CDS":
cds_seq = feature.extract(record.seq)
product = feature.qualifiers.get("product", ["Unknown"])[0]
cds_list.append((product, cds_seq))
# 4. Translate CDS
proteins = []
for product, cds in cds_list:
try:
protein = cds.translate(to_stop=True)
proteins.append((product, protein))
except:
proteins.append((product, None))
return {
'id': record.id,
'length': length,
'gc': gc_content,
'n_cds': len(cds_list),
'proteins': proteins
}
# Analyze gene
result = analyze_gene("NM_000518")
print(f"Gene: {result['id']}")
print(f"Length: {result['length']} bp")
print(f"GC: {result['gc']:.2f}%")
print(f"CDS: {result['n_cds']}")
RNA-Seq Read Processing
from Bio import SeqIO
import numpy as np
def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
"""Process RNA-seq reads with quality filtering."""
stats = {
'total': 0,
'passed_quality': 0,
'passed_length': 0,
'final': 0
}
passed_reads = []
for record in SeqIO.parse(fastq_file, "fastq"):
stats['total'] += 1
# Quality filter
qualities = record.letter_annotations["phred_quality"]
if np.mean(qualities) < min_quality:
continue
stats['passed_quality'] += 1
# Length filter
if len(record.seq) < min_length:
continue
stats['passed_length'] += 1
# Trim adapters (simple example)
# In reality, use specialized tools like Cutadapt
passed_reads.append(record)
stats['final'] += 1
return passed_reads, stats
# Process reads
reads, stats = process_rnaseq_reads("rnaseq.fastq")
print(f"Total: {stats['total']}")
print(f"Passed quality: {stats['passed_quality']}")
print(f"Passed length: {stats['passed_length']}")
print(f"Final: {stats['final']}")
Restriction Enzyme Analysis
from Bio import SeqIO
from Bio.Restriction import *
# Load sequence
record = SeqIO.read("plasmid.gb", "genbank")
seq = record.seq
# Create restriction batch
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
# Find restriction sites
analysis = rb.search(seq)
# Print results
for enzyme, sites in analysis.items():
if sites:
print(f"{enzyme}: {len(sites)} site(s) at {sites}")
else:
print(f"{enzyme}: No sites found")
# Find enzymes that cut once (good for cloning)
single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1]
print(f"\nSingle cutters: {[str(e) for e in single_cutters]}")
Codon Usage Analysis
from Bio import SeqIO
from collections import Counter
def analyze_codon_usage(genbank_file):
"""Analyze codon usage in CDS features."""
codon_counts = Counter()
total_codons = 0
for record in SeqIO.parse(genbank_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# Extract CDS sequence
cds = feature.extract(record.seq)
# Count codons
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
if len(codon) == 3:
codon_counts[codon] += 1
total_codons += 1
# Calculate frequencies
codon_freq = {codon: count/total_codons
for codon, count in codon_counts.items()}
return codon_counts, codon_freq
# Analyze
counts, freqs = analyze_codon_usage("genome.gb")
# Print most common codons
for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f"{codon}: {freq:.4f}")
Performance Optimization
Memory-Efficient File Processing
from Bio import SeqIO
def process_large_fasta(filename, process_func):
"""Process large FASTA without loading into memory."""
count = 0
for record in SeqIO.parse(filename, "fasta"):
process_func(record)
count += 1
# Progress report
if count % 10000 == 0:
print(f"Processed {count} sequences")
return count
# Example processor
def calculate_gc(record):
from Bio.SeqUtils import GC
gc = GC(record.seq)
if gc > 60:
print(f"{record.id}: High GC ({gc:.1f}%)")
# Process
n = process_large_fasta("large_genome.fasta", calculate_gc)
Parallel Processing
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial
def process_sequence(record, min_length=100):
"""Process single sequence."""
if len(record.seq) >= min_length:
from Bio.SeqUtils import GC
return (record.id, len(record.seq), GC(record.seq))
return None
def parallel_process_fasta(filename, n_processes=4):
"""Process FASTA in parallel."""
# Load sequences
records = list(SeqIO.parse(filename, "fasta"))
# Process in parallel
with Pool(n_processes) as pool:
results = pool.map(process_sequence, records)
# Filter None results
return [r for r in results if r is not None]
# Process
results = parallel_process_fasta("sequences.fasta", n_processes=4)
for seq_id, length, gc in results[:10]:
print(f"{seq_id}: {length} bp, GC={gc:.2f}%")
Index Files for Random Access
from Bio import SeqIO
# Create index (fast random access)
record_dict = SeqIO.index("large.fasta", "fasta")
# Access specific sequence instantly
record = record_dict["seq_12345"]
print(record.seq)
# Iterate (still memory efficient)
for key in record_dict:
record = record_dict[key]
process(record)
# Close when done
record_dict.close()
# SQLite-backed index (for very large files)
record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
Common Pitfalls and Solutions
Translation Frame Errors
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGC")
# Problem: Wrong reading frame
protein = dna.translate() # May have stop codons
# Solution: Check all frames
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True, table=1)
if len(protein) > 10: # Reasonable length
print(f"Frame {i}: {protein}")
File Format Mismatches
from Bio import SeqIO
# Problem: Wrong format specified
try:
records = SeqIO.parse("file.fasta", "genbank") # Wrong!
except:
print("Format mismatch")
# Solution: Verify format
import os
def guess_format(filename):
ext = os.path.splitext(filename)[1].lower()
format_map = {
'.fasta': 'fasta',
'.fa': 'fasta',
'.fna': 'fasta',
'.fastq': 'fastq',
'.fq': 'fastq',
'.gb': 'genbank',
'.gbk': 'genbank'
}
return format_map.get(ext, 'fasta')
format = guess_format("file.fasta")
records = SeqIO.parse("file.fasta", format)
NCBI Rate Limiting
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
# Problem: Too many requests
def bad_batch_download(id_list):
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id)
# This will get you blocked!
# Solution: Add delays
def good_batch_download(id_list):
results = []
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id,
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
results.append(record)
time.sleep(0.34) # ~3 requests/sec
return results
# Better: Use Entrez history for large batches
Memory Issues with Large Alignments
from Bio import AlignIO
# Problem: Loading huge alignment
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # OOM!
# Solution: Process incrementally
def process_alignment_chunks(filename, chunk_size=1000):
records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
chunk.append(record)
if len(chunk) >= chunk_size:
# Process chunk
process_chunk(chunk)
chunk = []
# Process remaining
if chunk:
process_chunk(chunk)
This comprehensive Biopython guide covers 50+ examples across all major bioinformatics workflows!