Contamination Screening
Screen FASTQ files against multiple genomes to identify contamination sources using FastQ Screen.
FastQ Screen Overview
FastQ Screen aligns a subset of reads against multiple reference genomes to identify:
-
Cross-species contamination
-
Bacterial/viral contamination
-
Adapter sequences
-
PhiX spike-in
-
Sample swaps
Basic Usage
Screen against configured genomes
fastq_screen sample.fastq.gz
Multiple files
fastq_screen *.fastq.gz
Specify output directory
fastq_screen --outdir qc_results/ sample.fastq.gz
Custom config file
fastq_screen --conf my_screen.conf sample.fastq.gz
Configuration File
Create fastq_screen.conf :
Database locations
DATABASE Human /path/to/human/genome DATABASE Mouse /path/to/mouse/genome DATABASE Ecoli /path/to/ecoli/genome DATABASE PhiX /path/to/phix/genome DATABASE Adapters /path/to/adapters DATABASE rRNA /path/to/rrna
Aligner (bowtie2 recommended)
BOWTIE2 /path/to/bowtie2
Or use BWA
BWA /path/to/bwa
Threads
THREADS 8
Pre-built Databases
Download common screening databases
fastq_screen --get_genomes
Downloads to ~/fastq_screen_databases/
Includes: Human, Mouse, Rat, E.coli, PhiX, Adapters, etc.
Screening Options
Number of reads to sample (default 100000)
fastq_screen --subset 200000 sample.fastq.gz
Use all reads (slow)
fastq_screen --subset 0 sample.fastq.gz
Set threads
fastq_screen --threads 8 sample.fastq.gz
Paired-end (screen R1 only by default)
fastq_screen sample_R1.fastq.gz
Force screening both pairs
fastq_screen --paired sample_R1.fastq.gz sample_R2.fastq.gz
Output Options
Generate PNG plot (default)
fastq_screen sample.fastq.gz
No plot (text only)
fastq_screen --nograph sample.fastq.gz
Generate additional mapping statistics
fastq_screen --tag sample.fastq.gz
Filter reads by mapping (keep unmapped to all genomes)
fastq_screen --filter 0000 sample.fastq.gz
Keep only reads mapping to first genome (e.g., Human)
fastq_screen --filter 1--- sample.fastq.gz
Filter Codes
Use --filter to select reads based on mapping status:
Code Meaning
0 Did not map to genome
1 Mapped uniquely
2 Mapped more than once
3 Mapped (unique or multi)
Ignore this genome
Example: Keep reads mapping only to Human (first genome)
Human:1, all others:0
fastq_screen --filter 10000 sample.fastq.gz
Keep reads NOT mapping to anything (clean reads)
fastq_screen --filter 00000 sample.fastq.gz
Output Files
File Description
*_screen.txt
Tab-delimited results
*_screen.png
Visualization
*_screen.html
HTML report
Results Format
#Fastq_screen version: 0.15.3 Genome #Reads_processed #Unmapped %Unmapped #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome %Multiple_hits_one_genome #One_hit_multiple_genomes %One_hit_multiple_genomes Multiple_hits_multiple_genomes %Multiple_hits_multiple_genomes Human 100000 2000 2.00 95000 95.00 1000 1.00 1500 1.50 500 0.50 Mouse 100000 98000 98.00 100 0.10 50 0.05 1500 1.50 350 0.35
Interpreting Results
Expected Results by Sample Type
Sample Type Expected Pattern
Human sample
90% Human, <1% others
Mouse sample
90% Mouse, <1% others
Human + PhiX
80% Human, ~10% PhiX
Contaminated Significant % to unexpected genome
Common Issues
Pattern Likely Cause
High adapter % Library prep issue
High PhiX % Spike-in not removed
High E.coli % Bacterial contamination
High rRNA % rRNA depletion failed
Multiple species Sample swap or contamination
MultiQC Integration
FastQ Screen results are automatically detected by MultiQC:
Screen all samples
for f in *.fastq.gz; do fastq_screen --outdir screen_results/ "$f" done
Aggregate with MultiQC
multiqc screen_results/
Custom Database Setup
Create Bowtie2 Index
Index a FASTA file
bowtie2-build reference.fa reference
Add to config
DATABASE MyGenome /path/to/reference
Common Databases to Include
Genome Purpose
Human (GRCh38) Human samples
Mouse (GRCm39) Mouse samples
E. coli Bacterial contamination
PhiX Illumina spike-in
Adapters Library prep
rRNA Ribosomal RNA
Vectors Cloning vectors
Mycoplasma Cell culture contamination
Example Workflows
Standard Screening
Download databases
fastq_screen --get_genomes
Screen samples
fastq_screen --outdir screen_results/ --threads 8 *.fastq.gz
Check results
multiqc screen_results/
Remove Contamination
Screen and tag reads
fastq_screen --tag sample.fastq.gz
Filter to keep only Human reads (assuming Human is first database)
fastq_screen --filter 3----- --tag sample.fastq.gz
Or use BBDuk for removal
bbduk.sh in=sample.fastq.gz out=clean.fastq.gz
ref=contaminants.fa k=31 hdist=1
Related Skills
-
quality-reports - FastQC shows overrepresented sequences
-
adapter-trimming - Remove adapter contamination
-
metagenomics - Deeper taxonomic analysis