scikit-bio

Python bioinformatics library for sequence manipulation, alignments, phylogenetics, diversity metrics (Shannon, UniFrac), ordination (PCoA, CCA), statistical tests (PERMANOVA, Mantel), and biological file format I/O.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scikit-bio" with this command: npx skills add aminoanalytica/amina-skills/aminoanalytica-amina-skills-scikit-bio

scikit-bio

A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.

When to Apply

Use this skill when users need to:

Task CategoryExamples
Sequence workDNA/RNA/protein manipulation, motif finding, translation
File handlingFASTA, FASTQ, GenBank, Newick, BIOM I/O
AlignmentsPairwise or multiple sequence alignment
PhylogeneticsTree construction, manipulation, distance calculations
Diversity metricsAlpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac)
OrdinationPCoA, CCA, RDA for dimensionality reduction
Statistical testsPERMANOVA, ANOSIM, Mantel tests
Microbiome analysisFeature tables, rarefaction, community comparisons

Installation

uv pip install scikit-bio

Sequences

Work with biological sequences through specialized DNA, RNA, and Protein classes.

import skbio

# Load from file
seq = skbio.DNA.read('gene.fasta')

# Common operations
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()

# Pattern search
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')

# Properties
contains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()

Metadata types:

  • Sequence-level: ID, description, source organism
  • Positional: Per-base quality scores (from FASTQ)
  • Interval: Feature annotations, gene boundaries

Sequence Alignment

Pairwise and multiple alignment using dynamic programming.

from skbio.alignment import local_pairwise_align_ssw, TabularMSA

# Local alignment (Smith-Waterman)
result = local_pairwise_align_ssw(query_seq, target_seq)

# Load existing alignment
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)

# Derive consensus
consensus_seq = alignment.consensus()

Notes:

  • local_pairwise_align_ssw provides fast SSW-based local alignment
  • StripedSmithWaterman handles protein sequences with substitution matrices
  • Affine gap penalties suit biological sequences best

Phylogenetic Trees

Construct and analyze evolutionary trees.

from skbio import TreeNode
from skbio.tree import nj, upgma

# Build from distances
phylogeny = nj(distance_matrix)

# Load existing tree
phylogeny = TreeNode.read('species.nwk')

# Extract subset
clade = phylogeny.shear(['mouse', 'rat', 'human'])

# Enumerate leaf nodes
leaves = list(phylogeny.tips())

# Common ancestor
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])

# Branch length between taxa
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))

# Pairwise distances for all tips
pairwise_dm = phylogeny.cophenetic_matrix()

# Topology comparison
rf_diff = phylogeny.robinson_foulds(other_tree)

Tree construction methods:

MethodUse case
nj()Standard neighbor-joining
upgma()Assumes molecular clock
bme()Scalable for large datasets

Diversity Analysis

Calculate ecological diversity metrics.

Alpha Diversity (within-sample)

from skbio.diversity import alpha_diversity

# Sample abundance matrix
abundances = np.array([
    [45, 12, 0, 8],
    [5, 0, 33, 17],
    [20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']

# Richness and evenness metrics
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)

# Phylogenetic diversity (requires tree)
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
                             tree=phylogeny, otu_ids=feature_names)

Beta Diversity (between-sample)

from skbio.diversity import beta_diversity

# Distance matrices
bray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
                            tree=phylogeny, otu_ids=feature_names)

Key points:

  • Input must be integer counts, not proportions
  • Phylogenetic metrics require a tree matching feature IDs
  • partial_beta_diversity() computes specific sample pairs efficiently

Ordination

Project high-dimensional data to visualizable spaces.

from skbio.stats.ordination import pcoa, cca

# PCoA from distance matrix
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained

# CCA with environmental predictors
constrained = cca(species_abundances, environmental_vars)

Methods:

FunctionInputPurpose
pcoa()Distance matrixUnconstrained ordination
cca()Abundance + environmentConstrained ordination (unimodal)
rda()Abundance + environmentConstrained ordination (linear)

Statistical Tests

Hypothesis testing for ecological data.

from skbio.stats.distance import permanova, anosim, mantel

# Group comparison
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")

# Alternative group test
anos_result = anosim(bray_dm, treatment_groups, permutations=999)

# Matrix correlation
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")

Test overview:

TestPurposeKey output
PERMANOVAGroup differencesF-statistic, p-value
ANOSIMGroup differences (alternative)R-statistic, p-value
PERMDISPDispersion homogeneityTests PERMANOVA assumption
MantelMatrix correlationCorrelation coefficient, p-value

File I/O

Read and write 19+ biological formats.

import skbio

# Automatic format detection
tree = skbio.TreeNode.read('phylogeny.nwk')

# Memory-efficient iteration
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
    if record.positional_metadata['quality'].mean() > 30:
        process(record)

# Format conversion
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')

Supported formats:

CategoryFormats
SequencesFASTA, FASTQ, GenBank, EMBL, QSeq
AlignmentsClustal, PHYLIP, Stockholm
TreesNewick
TablesBIOM (HDF5/JSON)
DistancesDelimited matrices

Distance Matrices

Store and manipulate pairwise distances.

from skbio import DistanceMatrix
import numpy as np

# Create from array
distances = np.array([
    [0.0, 0.3, 0.7],
    [0.3, 0.0, 0.5],
    [0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])

# Access elements
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']

# Subset
subset_dm = dm.filter(['sp_A', 'sp_C'])

Feature Tables (BIOM)

Handle OTU/ASV abundance tables.

from skbio import Table

# Load table
tbl = Table.read('features.biom')

# Inspect structure
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')

# Filter by abundance
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')

# Convert to pandas
df = tbl.to_dataframe()

Protein Embeddings

Bridge language model outputs with scikit-bio analysis.

from skbio.embedding import ProteinEmbedding

# Load embeddings (from ESM, ProtTrans, etc.)
emb = ProteinEmbedding(embedding_matrix, protein_ids)

# Create distance matrix for downstream analysis
emb_dm = emb.to_distances(metric='cosine')

# Ordination visualization
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')

Typical Workflows

Microbiome diversity study:

  1. Load BIOM table and phylogenetic tree
  2. Calculate alpha diversity per sample
  3. Compute beta diversity (UniFrac)
  4. Ordinate with PCoA
  5. Test group differences with PERMANOVA

Phylogenetic inference:

  1. Read sequences from FASTA
  2. Perform multiple alignment
  3. Calculate pairwise distances
  4. Construct tree with neighbor-joining
  5. Analyze clade relationships

Sequence processing:

  1. Read FASTQ with quality scores
  2. Filter low-quality reads
  3. Search for motifs
  4. Translate to protein
  5. Export as FASTA

Performance Tips

  • Use generators for large sequence files
  • Prefer BIOM HDF5 over JSON for big tables
  • Apply partial_beta_diversity() when computing only specific pairs
  • Choose BME for very large phylogenies

Ecosystem Integration

LibraryIntegration
pandasDataFrames from distance matrices, diversity results
numpyArray conversions throughout
matplotlib/seabornPlot ordination results, heatmaps
scikit-learnDistance matrices as input
QIIME 2Native BIOM, tree, distance matrix compatibility

Reference Files

FileContents
references/api-reference.mdComplete method signatures, parameters, extended examples, and troubleshooting

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

biopython

No summary provided by upstream source.

Repository SourceNeeds Review
General

alphafold-database

No summary provided by upstream source.

Repository SourceNeeds Review
General

rdkit

No summary provided by upstream source.

Repository SourceNeeds Review
General

pdb-database

No summary provided by upstream source.

Repository SourceNeeds Review