bioinformatics-fundamentals

Bioinformatics Fundamentals

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bioinformatics-fundamentals" with this command: npx skills add delphine-l/claude_global/delphine-l-claude-global-bioinformatics-fundamentals

Bioinformatics Fundamentals

Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.

When to Use This Skill

  • Working with sequencing data (PacBio HiFi, Hi-C, Illumina)

  • Debugging SAM/BAM alignment or filtering issues

  • Processing AGP files for genome assembly curation

  • Validating AGP coordinate systems and unloc assignments

  • Understanding paired-end vs single-end data

  • Interpreting quality metrics (MAPQ, PHRED scores)

  • Troubleshooting empty outputs or broken read pairs

  • Accessing GenomeArk QC data (GenomeScope, BUSCO, Merqury)

  • Curating karyotype data or chromosome count analysis

  • General bioinformatics data analysis

SAM/BAM Format Essentials

SAM Flags (Bitwise)

Flags are additive - a read can have multiple flags set simultaneously.

Common Flags:

  • 0x0001 (1): Read is paired in sequencing

  • 0x0002 (2): Each segment properly aligned (proper pair)

  • 0x0004 (4): Read unmapped

  • 0x0008 (8): Mate unmapped

  • 0x0010 (16): Read mapped to reverse strand

  • 0x0020 (32): Mate mapped to reverse strand

  • 0x0040 (64): First in pair (R1/forward)

  • 0x0080 (128): Second in pair (R2/reverse)

  • 0x0100 (256): Secondary alignment

  • 0x0400 (1024): PCR or optical duplicate

  • 0x0800 (2048): Supplementary alignment

Flag Combinations:

  • Properly paired R1: 99 (0x63 = 1 + 2 + 32 + 64)

  • Properly paired R2: 147 (0x93 = 1 + 2 + 16 + 128)

  • Unmapped read: 4

  • Mate unmapped: 8

See reference.md for complete flag tables, CIGAR operations, optional tags, and SAM mandatory fields.

Proper Pair Flag (0x0002)

What "proper pair" means:

  • Both R1 and R2 are mapped

  • Mapping orientations are correct (typically R1 forward, R2 reverse)

  • Insert size is reasonable for the library

  • Pair conforms to aligner's expectations

Important: Different aligners have different criteria for proper pairs!

MAPQ (Mapping Quality)

Formula: MAPQ = -10 * log10(P(mapping is wrong))

Common Thresholds:

  • MAPQ >= 60 : High confidence (error probability < 0.0001%)

  • MAPQ >= 30 : Good quality (error probability < 0.1%)

  • MAPQ >= 20 : Acceptable (error probability < 1%)

  • MAPQ >= 10 : Low confidence (error probability < 10%)

  • MAPQ = 0 : Multi-mapper or unmapped

Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.

CIGAR String

Represents alignment between read and reference:

  • M : Match or mismatch (alignment match)

  • I : Insertion in read vs reference

  • D : Deletion in read vs reference

  • S : Soft clipping (bases in read not aligned)

  • H : Hard clipping (bases not in read sequence)

  • N : Skipped region (for RNA-seq splicing)

Example: 100M = perfect 100bp match Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match

Sequencing Technologies

PacBio HiFi (High Fidelity)

Characteristics:

  • Long reads: 10-25 kb typical

  • High accuracy: >99.9% (Q20+)

  • Circular Consensus Sequencing (CCS)

  • Single-end data (though from circular molecules)

  • Excellent for de novo assembly

Best Mappers:

  • minimap2 presets: map-pb , map-hifi

  • BWA-MEM2 can work but optimized for short reads

Typical Use Cases:

  • De novo genome assembly

  • Structural variant detection

  • Isoform sequencing (Iso-Seq)

  • Haplotype phasing

Hi-C (Chromatin Conformation Capture)

Characteristics:

  • Paired-end short reads (typically 100-150 bp)

  • Read pairs capture chromatin interactions

  • R1 and R2 often map to different scaffolds/chromosomes

  • Requires careful proper pair handling

  • Used for scaffolding and 3D genome structure

Best Mappers:

  • BWA-MEM2 (paired-end mode)

  • BWA-MEM (paired-end mode)

Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!

Typical Use Cases:

  • Genome scaffolding (connecting contigs)

  • 3D chromatin structure analysis

  • Haplotype phasing

  • Assembly quality assessment

Illumina Short Reads

Characteristics:

  • Short reads: 50-300 bp

  • Paired-end or single-end

  • High throughput

  • Well-established quality scores

Best Mappers:

  • BWA-MEM2, BWA-MEM (general purpose)

  • Bowtie2 (fast, local alignment)

  • STAR (RNA-seq spliced alignment)

See reference.md for detailed technical specs, error profiles, and quality metrics per technology.

Common Tools and Their Behaviors

samtools view

Purpose: Filter, convert, and view SAM/BAM files

Key Flags:

  • -b : Output BAM format

  • -h : Include header

  • -f INT : Require flags (keep reads WITH these flags)

  • -F INT : Filter flags (remove reads WITH these flags)

  • -q INT : Minimum MAPQ threshold

  • -L FILE : Keep reads overlapping regions in BED file

Important Behavior:

  • -L (region filtering) checks each read individually, not pairs

  • Can break read pairs if mates map to different regions

  • Flag filters (-f , -F ) are applied before region filters (-L )

Example - Proper pairs in regions (correct order):

samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

bamtools filter

Purpose: Advanced filtering with complex criteria

Common Filters:

  • isPaired: true

  • Read is from paired-end sequencing

  • isProperPair: true

  • Read is part of proper pair

  • isMapped: true

  • Read is mapped

  • mapQuality: >=30

  • Mapping quality threshold

Important Difference from samtools:

  • isProperPair is more strict than samtools -f 2

  • Checks pair validity more thoroughly

samtools fastx

Purpose: Convert SAM/BAM to FASTQ/FASTA

Critical: Use appropriate filters to ensure R1/R2 files match!

See reference.md for complete tool command reference with all options and examples.

Common Patterns and Best Practices

Pattern 1: Filtering Paired-End Data by Regions

WRONG WAY (breaks pairs):

Region filter first -> breaks pairs when mates are in different regions

samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair

Result: Empty output (all pairs broken)

RIGHT WAY (preserves pairs):

Proper pair filter FIRST, then region filter

samtools view -b -f 2 -L regions.bed input.bam > output.bam

Pattern 2: Extracting FASTQ from Filtered BAM

For Paired-End:

samtools fastx -1 R1.fq.gz -2 R2.fq.gz
--i1-flags 2 \ # Require proper pair input.bam

For Single-End:

samtools fastx -0 output.fq.gz input.bam

Pattern 3: Quality Filtering

Conservative (high quality):

samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam

MAPQ >= 30, proper pairs, no secondary/supplementary

Permissive (for low-coverage data):

samtools view -b -q 10 -F 4 input.bam

MAPQ >= 10, mapped reads

Common Issues Summary

Issue 1: Empty Output After Region Filtering (Hi-C Data)

Region filter (samtools view -L ) breaks read pairs. One mate in region, other outside. Proper pair flag lost. Apply proper pair filter BEFORE region filtering:

samtools view -b -f 2 -L regions.bed input.bam > output.bam

Issue 2: R1 and R2 Files Have Different Read Counts

Improper filtering broke some pairs. Require proper pairs during extraction:

samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam

Issue 3: Low Mapping Rate for Hi-C Data

This is normal for Hi-C due to chimeric reads. Use Hi-C-specific pipelines (HiC-Pro, Juicer). Don't filter too aggressively on MAPQ.

Issue 4: Proper Pairs Lost After Mapping

Check insert size distribution, reference mismatch, or incorrect orientation flags.

samtools stats input.bam | grep "insert size" samtools flagstat input.bam

See common-issues.md for comprehensive troubleshooting with detailed solutions, including AGP processing issues, HiFi-specific problems, and diagnostic commands.

Quality Metrics

N50 and Related Metrics

N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer

Related Metrics:

  • L50: Number of contigs needed to reach N50

  • N90: More stringent than N50 (90% coverage)

  • NG50: N50 relative to genome size (better for comparisons)

Coverage and Depth

Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base

Recommended Depths:

  • Genome assembly (HiFi): 30-50x

  • Variant calling: 30x minimum

  • RNA-seq: 20-40 million reads

  • Hi-C scaffolding: 50-100x genomic coverage

See reference.md for complete coverage calculations, BUSCO interpretation, QV scores, and assembly quality metrics.

File Format Quick Reference

FASTA

sequence_id description ATCGATCGATCG

  • Header line starts with >

  • No quality scores

FASTQ

@read_id ATCGATCGATCG + IIIIIIIIIIII

  • Four lines per read

  • Quality scores (Phred+33 encoding typical)

BED

chr1 1000 2000 feature_name score +

  • 0-based coordinates

  • Half-open interval [start, end)

AGP

chr1 1 5000 1 W contig_1 1 5000 + chr1 5001 5100 2 U 100 scaffold yes proximity_ligation

  • Tab-delimited genome assembly format

  • 1-based closed coordinates [start, end]

  • Object and component lengths must match: obj_end - obj_beg + 1 == comp_end - comp_beg + 1

See reference.md for complete AGP specification, coordinate systems, validation rules, and processing patterns. See common-issues.md for AGP coordinate debugging and unloc processing issues.

Coordinate Systems

1-based (SAM, VCF, GFF, AGP): First base is position 1. Interval [2,5] includes positions 2,3,4,5. 0-based (BED, BAM binary): First base is position 0. Interval [2,5) includes positions 2,3,4 (excludes 5).

Conversion: BED_start = SAM_start - 1; BED_end = SAM_end.

Best Practices

General

  • Always check data type: Paired-end vs single-end determines filtering strategy

  • Understand your sequencing technology: Hi-C behaves differently than HiFi

  • Filter in the right order: Proper pairs BEFORE region filtering

  • Validate outputs: Check file sizes, read counts, flagstat

  • Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise

For Hi-C Data

  • Expect distant read pairs: Don't be surprised by different scaffolds

  • Preserve proper pairs: Critical for downstream scaffolding

  • Use paired-aware tools: Standard filters may break pairs

  • Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq

For HiFi Data

  • Single-end processing: No pair concerns

  • High quality expected: Can use strict filters

  • Use appropriate presets: minimap2 map-hifi or map-pb

  • Consider read length distribution: HiFi reads vary in length

For Tool Testing

  • Create self-contained datasets: Both mates in selected region

  • Maintain proper pairs: Essential for realistic testing

  • Use representative data: Subsample proportionally, not randomly

  • Verify file sizes: Too small = overly filtered

Related Skills

  • vgp-pipeline - VGP workflows process Hi-C and HiFi data

  • galaxy-tool-wrapping - Galaxy tools work with SAM/BAM and sequencing data formats

  • galaxy-workflow-development - Workflows process sequencing data

Supporting Documentation

  • reference.md: Detailed format specifications (SAM/BAM complete reference, CIGAR operations, AGP format, FASTQ encoding, tool command reference, sequencing technology specs, assembly quality metrics, coverage calculations, coordinate systems)

  • common-issues.md: Comprehensive troubleshooting guide (empty outputs, paired-end issues, quality/mapping problems, format conversion, Hi-C and HiFi specific issues, AGP processing errors, diagnostic commands)

  • genomeark-data-access.md: GenomeArk AWS S3 data access patterns (directory structure evolution, QC data locations for GenomeScope/BUSCO/Merqury, fetching strategies, path normalization, assembly date extraction)

  • genomic-analysis-patterns.md: Domain-specific analysis patterns (karyotype data curation, haploid vs diploid chromosome counts, phylogenetic tree species mapping, BED/telomere analysis, NCBI data integration strategies)

Version History

  • v1.2.0: Split into SKILL.md + supporting files for maintainability; moved GenomeArk access, karyotype/chromosome analysis, phylogenetic mapping, AGP details, and telomere/NCBI patterns to supporting files

  • v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies

  • v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns

  • v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

token-efficiency

No summary provided by upstream source.

Repository SourceNeeds Review
General

folder-organization

No summary provided by upstream source.

Repository SourceNeeds Review
General

obsidian

No summary provided by upstream source.

Repository SourceNeeds Review
General

claude-skill-management

No summary provided by upstream source.

Repository SourceNeeds Review