BAM Toolkit
Toolkit for BAM/SAM/CRAM alignment file analysis: extract reads, identify indels, and calculate coverage. Designed for WGS/WES sequencing result inspection and quality control.
Quick Start
Install
uv pip install pysam typer
Basic Usage
1. 特定領域のリードを抽出
python scripts/extract_reads.py --bam alignment.bam --region chr1:1000-2000 --output reads.json --format json
2. Indel を抽出
python scripts/extract_indels.py --bam alignment.bam --region chr1:1000-2000 --output indels.json
3. カバレッジ統計を計算
python scripts/calculate_coverage.py --bam alignment.bam --region chr1:1000-2000 --output coverage.json
Scripts
extract_reads.py - Read Extraction
Extract reads from BAM/SAM/CRAM files for specific genomic regions and export as BAM or JSON format.
Required Arguments
-
--bam PATH
-
Input BAM/SAM/CRAM file path
-
--region TEXT
-
Genomic region (e.g., chr1:1000-2000 )
Optional Arguments
Output:
-
--output PATH
-
Output file path (BAM or JSON based on extension)
-
--format TEXT
-
Output format: bam or json (default: bam )
Filters:
-
--min-mapq INT
-
Minimum mapping quality (default: 0)
-
--proper-pairs
-
Only properly paired reads
-
--no-duplicates
-
Exclude duplicate reads
Output Format (JSON)
{ "region": "chr1:1000-2000", "total_reads": 150, "filtered_from": 200, "filters": { "min_mapq": 30, "proper_pairs_only": true, "no_duplicates": true }, "reads": [ { "query_name": "read1", "reference_start": 1050, "reference_end": 1150, "sequence": "ATCG...", "quality": "IIII...", "mapping_quality": 60, "is_reverse": false, "cigar": "100M", "is_proper_pair": true, "is_duplicate": false, "is_paired": true, "mate_is_reverse": true, "mate_reference_name": "chr1", "mate_reference_start": 1200, "template_length": 250 } ] }
Output Format (BAM)
When --format bam is specified, outputs a filtered BAM file containing only reads that pass the specified filters.
Usage Examples
Extract reads as JSON
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--output reads.json
--format json
Extract reads as BAM
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--output reads.bam
Extract high-quality properly-paired reads
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--proper-pairs
--no-duplicates
--output filtered.bam
extract_indels.py - Insertion/Deletion Extraction
Extract insertions and deletions from reads in specified genomic regions by parsing CIGAR strings.
Required Arguments
-
--bam PATH
-
Input BAM/SAM/CRAM file path
-
--region TEXT
-
Genomic region (e.g., chr1:1000-2000 )
Optional Arguments
Output:
- --output PATH
- JSON output file path (default: stdout)
Filters:
-
--min-mapq INT
-
Minimum mapping quality (default: 20)
-
--min-indel-size INT
-
Minimum indel size in bp (default: 1)
Output Format (JSON)
{ "region": "chr1:1000-2000", "filters": { "min_mapq": 20, "min_indel_size": 1 }, "reads": { "total": 500, "processed": 450 }, "summary": { "total_insertions": 15, "total_deletions": 8, "unique_insertions": 5, "unique_deletions": 3 }, "insertions": [ { "position": 1050, "size": 3, "sequence": "ATG", "read_name": "read1", "mapping_quality": 60, "count": 3, "supporting_reads": ["read1", "read2", "read3"] } ], "deletions": [ { "position": 1100, "size": 2, "read_name": "read2", "mapping_quality": 55, "count": 2, "supporting_reads": ["read2", "read5"] } ] }
Implementation Details
-
Parses CIGAR strings to identify I (insertion) and D (deletion) operations
-
Extracts sequences for insertions from read sequences
-
Groups identical indels by position and sequence/size
-
Counts supporting reads for each unique indel
-
Filters by mapping quality and minimum indel size
Usage Examples
Extract all indels
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--output indels.json
Extract large indels only (>= 5bp)
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--min-indel-size 5
--output large_indels.json
High-quality indels only
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--output hq_indels.json
calculate_coverage.py - Coverage Calculation
Calculate coverage statistics for specified genomic regions using pileup operations.
Required Arguments
-
--bam PATH
-
Input BAM/SAM/CRAM file path
-
--region TEXT
-
Genomic region (e.g., chr1:1000-2000 )
Optional Arguments
Output:
- --output PATH
- JSON output file path (default: stdout)
Options:
-
--min-mapq INT
-
Minimum mapping quality (default: 0)
-
--min-baseq INT
-
Minimum base quality (default: 0)
Output Format (JSON)
{ "region": "chr1:1000-2000", "filters": { "min_mapq": 0, "min_baseq": 0 }, "statistics": { "total_bases": 1000, "mean_coverage": 45.3, "median_coverage": 48, "min_coverage": 0, "max_coverage": 120, "bases_with_coverage": 995, "bases_without_coverage": 5, "percent_covered": 99.5 } }
Usage Examples
Calculate coverage statistics
python scripts/calculate_coverage.py
--bam alignment.bam
--region chr1:1000-2000
--output coverage.json
High-quality bases only
python scripts/calculate_coverage.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--min-baseq 20
--output hq_coverage.json
Workflow Examples
Example 1: Comprehensive Read Analysis Workflow
Combine all three scripts for complete BAM analysis:
Step 1: Calculate coverage statistics
python scripts/calculate_coverage.py
--bam alignment.bam
--region chr1:1000-2000
--output coverage.json
Step 2: Extract indels
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--output indels.json
Step 3: Extract reads as JSON for detailed inspection
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--format json
--output reads.json
Example 2: High-Quality Indel Discovery
Identify large insertions and deletions with high-quality reads:
Extract large indels (>= 10bp) with high mapping quality
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--min-indel-size 10
--output large_indels.json
Extract supporting reads for manual validation
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--output supporting_reads.bam
Example 3: Coverage Quality Control
Assess sequencing coverage quality across target region:
Calculate coverage statistics with quality filters
python scripts/calculate_coverage.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--min-baseq 20
--output coverage_stats.json
Extract reads from low-coverage regions for inspection
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1500-1600
--format json
--output low_coverage_reads.json
Error Handling
Missing BAM Index
$ python scripts/extract_reads.py --bam alignment.bam --region chr1:1000-2000
Error fetching reads: random access requires an index Make sure the region 'chr1:1000-2000' exists and BAM file is indexed.
Solution: Create BAM index with samtools:
samtools index alignment.bam
Invalid Region Format
$ python scripts/extract_reads.py --bam alignment.bam --region 1000-2000
Error: Invalid region format: 1000-2000. Expected format: chr1:1000-2000
Solution: Use correct region format chr:start-end :
python scripts/extract_reads.py --bam alignment.bam --region chr1:1000-2000
Best Practices
- Always Index BAM Files
Create BAM index before using any scripts to enable efficient random access:
✅ Good: Index BAM file first
samtools index alignment.bam
Then use bam-toolkit scripts
python scripts/extract_reads.py --bam alignment.bam --region chr1:1000-2000
- Apply Quality Filters for Reliable Results
Use mapping quality and base quality filters to ensure reliable analysis:
✅ Good: Apply quality filters
python scripts/extract_indels.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
✅ Good: Use proper pairs for structural variant analysis
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--proper-pairs
--no-duplicates
- Extract Reads as BAM for Further Processing
Use BAM output when planning to use extracted reads with other tools:
✅ Good: Extract as BAM for use with samtools/IGV
python scripts/extract_reads.py
--bam alignment.bam
--region chr1:1000-2000
--min-mapq 30
--output filtered.bam
Then index and use with other tools
samtools index filtered.bam samtools view filtered.bam
When to Use bam-toolkit vs samtools
Task bam-toolkit samtools
Read extraction as JSON ✅ extract_reads.py
Indel extraction with grouping ✅ extract_indels.py
Coverage statistics as JSON ✅ calculate_coverage.py ✅ samtools depth
BAM-to-BAM filtering ✅ extract_reads.py ✅ samtools view
Read alignment statistics
✅ samtools flagstat
BAM indexing
✅ samtools index
Recommended Workflow:
-
Index BAM files with samtools
-
Use bam-toolkit scripts for structured JSON output
-
Use samtools for standard BAM operations (sorting, indexing, format conversion)
Related Skills
-
vcf-toolkit - VCF/BCF variant file operations
-
sequence-io - FASTA/FASTQ sequence file operations
-
blast-search - BLAST homology search
-
blat-api-searching - BLAT genome mapping
Troubleshooting
BAM File Not Readable
Ensure BAM file is properly formatted and readable:
Check BAM file integrity
samtools quickcheck alignment.bam
View BAM header
samtools view -H alignment.bam
Chromosome Name Mismatch
Chromosome names must match between BAM file and region specification:
Check chromosome names in BAM
samtools idxstats alignment.bam | cut -f1
Use correct chromosome name
If BAM uses "1" instead of "chr1":
python scripts/extract_reads.py --bam alignment.bam --region 1:1000-2000
Empty Output
If output is empty, check:
-
Region contains aligned reads: samtools view alignment.bam chr1:1000-2000 | head
-
Filters are not too restrictive (try reducing --min-mapq )
-
Chromosome name matches BAM file