Pairwise Sequence Alignment

Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).

Required Import

from Bio.Align import PairwiseAligner from Bio.Seq import Seq from Bio import SeqIO

Core Concepts

Mode Algorithm Use Case

global

Needleman-Wunsch Full-length alignment, similar-length sequences

local

Smith-Waterman Find best matching regions, different-length sequences

Creating an Aligner

Basic aligner with defaults

aligner = PairwiseAligner()

Configure mode and scoring

aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

For protein alignment with substitution matrix

from Bio.Align import substitution_matrices aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))

Performing Alignments

seq1 = Seq('ACCGGTAACGTAG') seq2 = Seq('ACCGTTAACGAAG')

Get all optimal alignments

alignments = aligner.align(seq1, seq2) print(f'Found {len(alignments)} optimal alignments') print(alignments[0]) # Print first alignment

Get score only (faster for large sequences)

score = aligner.score(seq1, seq2)

Alignment Output Format

target 0 ACCGGTAACGTAG 13 0 |||||.||||.|| 13 query 0 ACCGTTAACGAAG 13

Accessing Alignment Data

alignment = alignments[0]

Basic properties

print(alignment.score) # Alignment score print(alignment.shape) # (num_seqs, alignment_length) print(len(alignment)) # Alignment length

Get aligned sequences with gaps

target_aligned = alignment[0, :] # First sequence (target) with gaps query_aligned = alignment[1, :] # Second sequence (query) with gaps

Get coordinate mapping

print(alignment.aligned) # Array of aligned segment coordinates print(alignment.coordinates) # Full coordinate array

Alignment Counts (Identities, Mismatches, Gaps)

alignment = alignments[0] counts = alignment.counts()

print(f'Identities: {counts.identities}') print(f'Mismatches: {counts.mismatches}') print(f'Gaps: {counts.gaps}')

Calculate percent identity

total_aligned = counts.identities + counts.mismatches percent_identity = counts.identities / total_aligned * 100 print(f'Percent identity: {percent_identity:.1f}%')

Common Scoring Configurations

DNA/RNA Alignment

aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Protein Alignment

from Bio.Align import substitution_matrices blosum62 = substitution_matrices.load('BLOSUM62') aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)

Local Alignment (Find Best Region)

aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Semiglobal (Overlap/Extension)

Allow free end gaps on query (useful for primer alignment)

aligner = PairwiseAligner(mode='global') aligner.query_left_open_gap_score = 0 aligner.query_left_extend_gap_score = 0 aligner.query_right_open_gap_score = 0 aligner.query_right_extend_gap_score = 0

Available Substitution Matrices

from Bio.Align import substitution_matrices print(substitution_matrices.load()) # List all available matrices

Common matrices

blosum62 = substitution_matrices.load('BLOSUM62') # General protein blosum80 = substitution_matrices.load('BLOSUM80') # Closely related proteins pam250 = substitution_matrices.load('PAM250') # Distantly related proteins

Working with SeqRecord Objects

from Bio import SeqIO

records = list(SeqIO.parse('sequences.fasta', 'fasta')) seq1, seq2 = records[0].seq, records[1].seq

aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1) alignments = aligner.align(seq1, seq2)

Iterating Over Multiple Alignments

Limit number of alignments returned (memory efficient)

aligner.max_alignments = 100

for i, alignment in enumerate(alignments): print(f'Alignment {i+1}: score={alignment.score}') if i >= 4: break

Substitution Matrix from Alignment

alignment = alignments[0] substitutions = alignment.substitutions

View as array (rows=target, cols=query)

print(substitutions)

Access specific substitution counts

substitutions['A', 'T'] gives count of A aligned to T

Export Alignment to Different Formats

alignment = alignments[0]

Various output formats

print(format(alignment, 'fasta')) # FASTA format print(format(alignment, 'clustal')) # Clustal format print(format(alignment, 'psl')) # PSL format (BLAT) print(format(alignment, 'sam')) # SAM format

Quick Reference: Scoring Parameters

Parameter Description Typical DNA Typical Protein

match_score

Score for identical bases 1-2 Use matrix

mismatch_score

Penalty for mismatches -1 to -3 Use matrix

open_gap_score

Cost to start a gap -5 to -15 -10 to -12

extend_gap_score

Cost per gap extension -0.5 to -2 -0.5 to -1

substitution_matrix

Scoring matrix N/A BLOSUM62

Common Errors

Error Cause Solution

OverflowError

Too many optimal alignments Set aligner.max_alignments

Low scores Wrong scoring scheme Use substitution matrix for proteins

No alignments in local mode Scores all negative Ensure match_score

0

Decision Tree: Choosing Alignment Mode

Need full-length comparison? ├── Yes → Use mode='global' │ └── Sequences similar length? │ ├── Yes → Standard global │ └── No → Consider semiglobal (free end gaps) └── No → Use mode='local' └── Find best matching regions only

Related Skills

alignment-io - Save alignments to files in various formats
msa-parsing - Work with multiple sequence alignments
msa-statistics - Calculate identity, similarity metrics
sequence-manipulation/motif-search - Pattern matching in sequences

bio-alignment-pairwise

Safety Notice

Copy this and send it to your AI assistant to learn