Sequence I/O
Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).
When to Use This Skill
This skill should be used when:
-
Reading or writing sequence files (FASTA, GenBank, FASTQ)
-
Converting between sequence file formats
-
Manipulating sequences (complement, reverse complement, translate)
-
Extracting sequences from large indexed FASTA files (faidx)
-
Calculating sequence statistics (GC content, molecular weight, Tm)
When NOT to Use This Skill
-
NGS alignment files (SAM/BAM/VCF) → Use pysam
-
BLAST searches → Use gget (quick) or blat-integration (large-scale)
-
Multiple sequence alignment → Use msa-advanced
-
Phylogenetic analysis → Use etetoolkit
-
NCBI database queries → Use pubmed-database or gene-database
Tool Selection Guide
Task Tool Reference
Parse FASTA/GenBank/FASTQ Bio.SeqIO
biopython_seqio.md
Convert file formats Bio.SeqIO.convert()
biopython_seqio.md
Sequence operations Bio.Seq
biopython_seqio.md
Large FASTA random access pysam.FastaFile
- faidx faidx.md
GC%, Tm, molecular weight Bio.SeqUtils
utilities.md
Quick Start
Installation
uv pip install biopython pysam
Read FASTA
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp")
Convert GenBank to FASTA
from Bio import SeqIO
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
Random Access with faidx
import pysam
Create index (once)
pysam.faidx("reference.fasta")
Random access
fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close()
Sequence Operations
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG") print(seq.complement()) print(seq.reverse_complement()) print(seq.translate())
Reference Documentation
Consult the appropriate reference file for detailed documentation:
references/biopython_seqio.md
-
Bio.Seq object and sequence operations
-
Bio.SeqIO for file parsing and writing
-
SeqRecord object and annotations
-
Supported file formats
-
Format conversion patterns
references/faidx.md
-
Creating FASTA index with pysam.faidx()
-
pysam.FastaFile for random access
-
Coordinate systems (0-based vs 1-based)
-
Performance considerations for large files
-
Common patterns (variant context, gene extraction)
references/utilities.md
-
GC content calculation (gc_fraction )
-
Molecular weight (molecular_weight )
-
Melting temperature (MeltingTemp )
-
Codon usage analysis
-
Restriction enzyme sites
references/formats.md
-
FASTA format specification
-
GenBank format specification
-
FASTQ format and quality scores
-
Format detection and validation
Coordinate Systems
Biopython: Uses Python-style 0-based, half-open intervals for slicing.
pysam.FastaFile.fetch():
-
Numeric arguments: 0-based (fetch("chr1", 999, 2000) = positions 999-1999)
-
Region strings: 1-based (fetch("chr1:1000-2000") = positions 1000-2000)
Common Pitfalls
-
Coordinate confusion: Remember which tool uses 0-based vs 1-based
-
Missing faidx index: Random access requires .fai file
-
Format mismatch: Verify file format matches the format string in SeqIO.parse()
-
Iterator exhaustion: SeqIO.parse() returns an iterator; convert to list if multiple passes needed
-
Large files: Use iterators, not list() , for memory efficiency