bio-genome-assembly-contamination-detection

Contamination Detection

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-genome-assembly-contamination-detection" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-genome-assembly-contamination-detection

Contamination Detection

CheckM2 (Recommended)

Run CheckM2 on single genome

checkm2 predict --input assembly.fa --output-directory checkm2_output --threads 16

Run on multiple genomes (directory of FASTAs)

checkm2 predict --input genomes/ --output-directory checkm2_output
--threads 16 --extension fa

Output: quality_report.tsv with Completeness, Contamination, Coding_Density

Interpret CheckM2 Results

quality_report.tsv columns:

Name, Completeness, Contamination, Completeness_Model_Used,

Translation_Table_Used, Coding_Density, Contig_N50, Average_Gene_Length,

Genome_Size, GC_Content, Total_Coding_Sequences

Filter high-quality genomes (MIMAG standards)

awk -F'\t' 'NR==1 || ($2 > 90 && $3 < 5)' quality_report.tsv > high_quality_mags.tsv

Medium quality

awk -F'\t' 'NR==1 || ($2 >= 50 && $3 < 10)' quality_report.tsv > medium_quality_mags.tsv

CheckM (Original)

Run CheckM lineage workflow

checkm lineage_wf -t 16 -x fa genomes/ checkm_output/

Generate summary

checkm qa checkm_output/lineage.ms checkm_output/ -o 2 -f checkm_summary.tsv --tab_table

Extended report with marker genes

checkm qa checkm_output/lineage.ms checkm_output/ -o 2 --tab_table
-f checkm_extended.tsv

CheckM Plots

Completeness vs Contamination plot

checkm bin_qa_plot -x fa checkm_output/ genomes/ plots/

GC and coding density

checkm coding_plot -x fa checkm_output/ genomes/ plots/

Marker gene positions

checkm marker_plot -x fa checkm_output/ genomes/ plots/

GTDB-Tk Taxonomic Classification

Classify genomes

gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output
--extension fa --cpus 16

With species-level ANI

gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output
--extension fa --cpus 16 --skip_ani_screen

Output files:

gtdbtk.bac120.summary.tsv - bacterial classifications

gtdbtk.ar53.summary.tsv - archaeal classifications

GTDB-Tk De Novo Workflow

When genomes may include novel taxa

gtdbtk de_novo_wf --genome_dir genomes/ --out_dir gtdbtk_denovo
--bacteria --extension fa --cpus 16

GUNC Chimerism Detection

Run GUNC

gunc run -d genomes/ -o gunc_output -t 16 -e .fa

Output: GUNC.progenomes_2.1.maxCSS_level.tsv

Key columns: pass.GUNC (true/false), contamination_portion, clade_separation_score

Filter chimeric genomes

awk -F'\t' '$8 == "False"' GUNC.progenomes_2.1.maxCSS_level.tsv > chimeric_genomes.tsv

GUNC Interpretation

GUNC flags genomes as chimeric if:

- clade_separation_score (CSS) > 0.45

- contamination_portion > 0.05

- reference_representation_score > 0.5

Combine with CheckM2 for full QC

join -t$'\t' -1 1 -2 1
<(sort checkm2_output/quality_report.tsv)
<(sort gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv)
> combined_qc.tsv

Comprehensive QC Pipeline

#!/bin/bash GENOMES_DIR=$1 OUTPUT_DIR=$2 THREADS=${3:-16}

mkdir -p "$OUTPUT_DIR"

Run CheckM2

echo "Running CheckM2..." checkm2 predict --input "$GENOMES_DIR" --output-directory "$OUTPUT_DIR/checkm2"
--threads "$THREADS" --extension fa

Run GUNC

echo "Running GUNC..." gunc run -d "$GENOMES_DIR" -o "$OUTPUT_DIR/gunc" -t "$THREADS" -e .fa

Run GTDB-Tk

echo "Running GTDB-Tk..." gtdbtk classify_wf --genome_dir "$GENOMES_DIR" --out_dir "$OUTPUT_DIR/gtdbtk"
--extension fa --cpus "$THREADS"

echo "QC complete!"

Filter by Quality Standards

import pandas as pd

checkm = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t') gunc = pd.read_csv('gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv', sep='\t')

merged = checkm.merge(gunc, left_on='Name', right_on='genome', how='left')

MIMAG High Quality: >90% complete, <5% contamination, not chimeric

hq = merged[(merged['Completeness'] > 90) & (merged['Contamination'] < 5) & (merged['pass.GUNC'] == True)]

MIMAG Medium Quality: >50% complete, <10% contamination

mq = merged[(merged['Completeness'] >= 50) & (merged['Contamination'] < 10)]

hq.to_csv('high_quality_genomes.tsv', sep='\t', index=False) mq.to_csv('medium_quality_genomes.tsv', sep='\t', index=False)

Remove Contamination

Use MAGpurify to remove contaminating contigs

magpurify phylo-markers genome.fa magpurify_output magpurify clade-markers genome.fa magpurify_output magpurify conspecific genome.fa magpurify_output magpurify tetra-freq genome.fa magpurify_output magpurify gc-content genome.fa magpurify_output magpurify known-contam genome.fa magpurify_output magpurify clean-bin genome.fa magpurify_output cleaned_genome.fa

Detect Foreign Contigs

Contig-level taxonomy with CAT

CAT contigs -c assembly.fa -d CAT_database -t CAT_taxonomy
-o cat_output -n 16

Parse results

CAT add_names -i cat_output.contig2classification.txt
-o cat_output.contig2classification.named.txt
-t CAT_taxonomy --only_official

Flag contigs with different taxonomy than majority

awk -F'\t' '{print $1, $NF}' cat_output.contig2classification.named.txt |
sort | uniq -c | sort -rn

Decontaminate with BlobTools

Create BlobDB

blobtools create -i assembly.fa -b aligned.bam -t blast_hits.txt
-o blobtools_output

Generate plots

blobtools plot -i blobtools_output.blobDB.json

Filter by taxonomy

blobtools view -i blobtools_output.blobDB.json -r all -o filtered

Related Skills

  • genome-assembly/assembly-qc - BUSCO and other QC

  • genome-assembly/long-read-assembly - Assembly methods

  • metagenomics/taxonomic-profiling - Metagenome analysis

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

bioskills

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-data-visualization-genome-tracks

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-epitranscriptomics-merip-preprocessing

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-data-visualization-multipanel-figures

No summary provided by upstream source.

Repository SourceNeeds Review