bio-rnaseq-qc

RNA-seq Quality Control

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-rnaseq-qc" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-rnaseq-qc

RNA-seq Quality Control

RNA-seq specific QC metrics beyond general read quality.

rRNA Contamination Detection

High rRNA content indicates failed rRNA depletion or polyA selection.

SortMeRNA

sortmerna
--ref rRNA_databases/smr_v4.3_default_db.fasta
--reads sample.fastq.gz
--aligned rRNA_reads
--other non_rRNA_reads
--fastx
--threads 8

rrna_count=$(grep -c "^@" rRNA_reads.fastq 2>/dev/null || echo 0) total_count=$(zcat sample.fastq.gz | grep -c "^@") rrna_pct=$(echo "scale=2; $rrna_count / $total_count * 100" | bc) echo "rRNA: ${rrna_pct}%"

BLAST Against rRNA

seqkit sample -n 10000 sample.fastq.gz | seqkit fq2fa > sample_10k.fasta blastn -query sample_10k.fasta -db rrna_db -outfmt 6 -evalue 1e-10 -max_target_seqs 1 | wc -l

Expected rRNA Levels

Library Type Expected rRNA

PolyA selected < 5%

rRNA depleted < 10%

Total RNA 50-80%

Strandedness Verification

RSeQC infer_experiment

infer_experiment.py -i aligned.bam -r genes.bed

Output Interpretation

Fraction of reads explained by "1++,1--,2+-,2-+": 0.9856 # Forward stranded Fraction of reads explained by "1+-,1-+,2++,2--": 0.0144 # Reverse (should be low)

Strand Inference

Tool Setting 1++,1--,2+-,2-+ 1+-,1-+,2++,2--

Forward (dUTP) ~0 ~1

Reverse (Illumina) ~1 ~0

Unstranded ~0.5 ~0.5

Salmon Strandedness

salmon quant -i index -l A -r sample.fastq.gz -o quant/ grep "library_types" quant/lib_format_counts.json

Gene Body Coverage

Check for 3' or 5' bias indicating RNA degradation.

RSeQC geneBody_coverage

geneBody_coverage.py
-i aligned.bam
-r housekeeping_genes.bed
-o coverage

Interpretation

Pattern Indicates

Even coverage Good quality

3' bias Degradation or polyA artifacts

5' bias Incomplete reverse transcription

Steep drop Severe degradation

Read Distribution

RSeQC read_distribution

read_distribution.py -i aligned.bam -r genes.bed > distribution.txt

Expected Distribution

Region Good Library

CDS_Exons 60-80%

UTRs 10-20%

Introns 5-20%

Intergenic < 10%

Transcript Integrity Number (TIN)

Measure of RNA degradation per transcript.

RSeQC tin

tin.py -i aligned.bam -r genes.bed > tin_scores.txt

TIN Interpretation

TIN Score Quality

70 Good

50-70 Moderate

< 50 Poor

Duplication Rate

Picard MarkDuplicates

java -jar picard.jar MarkDuplicates
I=aligned.bam
O=marked.bam
M=dup_metrics.txt
REMOVE_DUPLICATES=false

grep -A 1 "LIBRARY" dup_metrics.txt | tail -1 | cut -f9

RNA-seq Expected Duplication

Library Expected

High complexity < 20%

Low input 20-50%

Concerning

50%

Insert Size (Paired-End)

Picard CollectInsertSizeMetrics

java -jar picard.jar CollectInsertSizeMetrics
I=aligned.bam
O=insert_metrics.txt
H=insert_histogram.pdf

Saturation Analysis

Subsampling Analysis

for frac in 0.1 0.25 0.5 0.75 1.0; do samtools view -bs $frac aligned.bam > sub_${frac}.bam featureCounts -a genes.gtf -o counts_${frac}.txt sub_${frac}.bam detected=$(awk '$7 > 0' counts_${frac}.txt | wc -l) echo "$frac: $detected genes" done

Picard CollectRnaSeqMetrics

Comprehensive RNA-seq metrics from Picard.

java -jar picard.jar CollectRnaSeqMetrics
I=aligned.bam
O=rnaseq_metrics.txt
REF_FLAT=refFlat.txt
STRAND=SECOND_READ_TRANSCRIPTION_STRAND
RIBOSOMAL_INTERVALS=rRNA.interval_list

Key Metrics

Metric Description

PCT_CODING_BASES % in coding regions

PCT_UTR_BASES % in UTRs

PCT_INTRONIC_BASES % in introns

PCT_INTERGENIC_BASES % intergenic

PCT_RIBOSOMAL_BASES % rRNA

MEDIAN_5PRIME_TO_3PRIME_BIAS 3' bias

MultiQC Report

Aggregate all QC metrics.

multiqc fastqc/ star_output/ featurecounts/ -o multiqc_report/

Complete RNA-seq QC Pipeline

#!/bin/bash SAMPLE=$1 BAM=$2 GENES_BED=$3 REF_FLAT=$4

echo "=== RNA-seq QC: $SAMPLE ===" > qc_report.txt

echo -e "\n--- Strandedness ---" >> qc_report.txt infer_experiment.py -i $BAM -r $GENES_BED >> qc_report.txt

echo -e "\n--- Read Distribution ---" >> qc_report.txt read_distribution.py -i $BAM -r $GENES_BED >> qc_report.txt

echo -e "\n--- Gene Body Coverage ---" >> qc_report.txt geneBody_coverage.py -i $BAM -r $GENES_BED -o coverage

echo -e "\n--- TIN Scores ---" >> qc_report.txt tin.py -i $BAM -r $GENES_BED > tin.txt awk '{sum+=$3; count++} END {print "Mean TIN:", sum/count}' tin.txt >> qc_report.txt

echo -e "\n--- Duplication ---" >> qc_report.txt java -jar picard.jar MarkDuplicates I=$BAM O=/dev/null M=dup.txt 2>/dev/null grep -A 1 "LIBRARY" dup.txt | tail -1 | awk '{print "Duplication rate:", $9}' >> qc_report.txt

echo -e "\n--- RNA-seq Metrics ---" >> qc_report.txt java -jar picard.jar CollectRnaSeqMetrics I=$BAM O=rnaseq.txt REF_FLAT=$REF_FLAT STRAND=SECOND_READ_TRANSCRIPTION_STRAND 2>/dev/null grep -A 2 "## METRICS CLASS" rnaseq.txt >> qc_report.txt

cat qc_report.txt

Python QC Summary

import pysam import numpy as np from collections import Counter

def rnaseq_qc(bam_file, sample_size=100000): bam = pysam.AlignmentFile(bam_file, 'rb')

strand_counts = Counter()
insert_sizes = []

for i, read in enumerate(bam.fetch()):
    if i >= sample_size:
        break
    if not read.is_unmapped:
        if read.is_read1:
            if read.is_reverse:
                strand_counts['1-'] += 1
            else:
                strand_counts['1+'] += 1
        if read.is_proper_pair and read.template_length > 0:
            insert_sizes.append(read.template_length)

bam.close()

total = sum(strand_counts.values())
print(f'Read 1 forward: {strand_counts["1+"]/total:.2%}')
print(f'Read 1 reverse: {strand_counts["1-"]/total:.2%}')

if insert_sizes:
    print(f'Median insert: {np.median(insert_sizes):.0f}')

rnaseq_qc('aligned.bam')

QC Thresholds Summary

Metric Good Warning Fail

Mapping rate

85% 70-85% < 70%

rRNA % < 10% 10-20%

20%

Exonic %

60% 40-60% < 40%

Duplication < 20% 20-40%

40%

Mean TIN

70 50-70 < 50

3' bias < 1.5 1.5-2

2

Related Skills

  • quality-reports - General FastQC

  • fastp-workflow - Read trimming

  • alignment-files/alignment-validation - General BAM QC

  • rna-quantification/featurecounts-counting - Quantification after QC

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

bioskills

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-data-visualization-genome-tracks

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-epitranscriptomics-merip-preprocessing

No summary provided by upstream source.

Repository SourceNeeds Review
General

bio-data-visualization-multipanel-figures

No summary provided by upstream source.

Repository SourceNeeds Review