Pairwise Sequence Alignment
Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).
Required Import
from Bio.Align import PairwiseAligner from Bio.Seq import Seq from Bio import SeqIO
Core Concepts
Mode Algorithm Use Case
global
Needleman-Wunsch Full-length alignment, similar-length sequences
local
Smith-Waterman Find best matching regions, different-length sequences
Creating an Aligner
Basic aligner with defaults
aligner = PairwiseAligner()
Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
For protein alignment with substitution matrix
from Bio.Align import substitution_matrices aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))
Performing Alignments
seq1 = Seq('ACCGGTAACGTAG') seq2 = Seq('ACCGTTAACGAAG')
Get all optimal alignments
alignments = aligner.align(seq1, seq2) print(f'Found {len(alignments)} optimal alignments') print(alignments[0]) # Print first alignment
Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)
Alignment Output Format
target 0 ACCGGTAACGTAG 13 0 |||||.||||.|| 13 query 0 ACCGTTAACGAAG 13
Accessing Alignment Data
alignment = alignments[0]
Basic properties
print(alignment.score) # Alignment score print(alignment.shape) # (num_seqs, alignment_length) print(len(alignment)) # Alignment length
Get aligned sequences with gaps
target_aligned = alignment[0, :] # First sequence (target) with gaps query_aligned = alignment[1, :] # Second sequence (query) with gaps
Get coordinate mapping
print(alignment.aligned) # Array of aligned segment coordinates print(alignment.coordinates) # Full coordinate array
Alignment Counts (Identities, Mismatches, Gaps)
alignment = alignments[0] counts = alignment.counts()
print(f'Identities: {counts.identities}') print(f'Mismatches: {counts.mismatches}') print(f'Gaps: {counts.gaps}')
Calculate percent identity
total_aligned = counts.identities + counts.mismatches percent_identity = counts.identities / total_aligned * 100 print(f'Percent identity: {percent_identity:.1f}%')
Common Scoring Configurations
DNA/RNA Alignment
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
Protein Alignment
from Bio.Align import substitution_matrices blosum62 = substitution_matrices.load('BLOSUM62') aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)
Local Alignment (Find Best Region)
aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
Semiglobal (Overlap/Extension)
Allow free end gaps on query (useful for primer alignment)
aligner = PairwiseAligner(mode='global') aligner.query_left_open_gap_score = 0 aligner.query_left_extend_gap_score = 0 aligner.query_right_open_gap_score = 0 aligner.query_right_extend_gap_score = 0
Available Substitution Matrices
from Bio.Align import substitution_matrices print(substitution_matrices.load()) # List all available matrices
Common matrices
blosum62 = substitution_matrices.load('BLOSUM62') # General protein blosum80 = substitution_matrices.load('BLOSUM80') # Closely related proteins pam250 = substitution_matrices.load('PAM250') # Distantly related proteins
Working with SeqRecord Objects
from Bio import SeqIO
records = list(SeqIO.parse('sequences.fasta', 'fasta')) seq1, seq2 = records[0].seq, records[1].seq
aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1) alignments = aligner.align(seq1, seq2)
Iterating Over Multiple Alignments
Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100
for i, alignment in enumerate(alignments): print(f'Alignment {i+1}: score={alignment.score}') if i >= 4: break
Substitution Matrix from Alignment
alignment = alignments[0] substitutions = alignment.substitutions
View as array (rows=target, cols=query)
print(substitutions)
Access specific substitution counts
substitutions['A', 'T'] gives count of A aligned to T
Export Alignment to Different Formats
alignment = alignments[0]
Various output formats
print(format(alignment, 'fasta')) # FASTA format print(format(alignment, 'clustal')) # Clustal format print(format(alignment, 'psl')) # PSL format (BLAT) print(format(alignment, 'sam')) # SAM format
Quick Reference: Scoring Parameters
Parameter Description Typical DNA Typical Protein
match_score
Score for identical bases 1-2 Use matrix
mismatch_score
Penalty for mismatches -1 to -3 Use matrix
open_gap_score
Cost to start a gap -5 to -15 -10 to -12
extend_gap_score
Cost per gap extension -0.5 to -2 -0.5 to -1
substitution_matrix
Scoring matrix N/A BLOSUM62
Common Errors
Error Cause Solution
OverflowError
Too many optimal alignments Set aligner.max_alignments
Low scores Wrong scoring scheme Use substitution matrix for proteins
No alignments in local mode Scores all negative Ensure match_score
0
Decision Tree: Choosing Alignment Mode
Need full-length comparison? ├── Yes → Use mode='global' │ └── Sequences similar length? │ ├── Yes → Standard global │ └── No → Consider semiglobal (free end gaps) └── No → Use mode='local' └── Find best matching regions only
Related Skills
-
alignment-io - Save alignments to files in various formats
-
msa-parsing - Work with multiple sequence alignments
-
msa-statistics - Calculate identity, similarity metrics
-
sequence-manipulation/motif-search - Pattern matching in sequences