phylogenetics

Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "phylogenetics" with this command: npx skills add k-dense-ai/claude-scientific-skills/k-dense-ai-claude-scientific-skills-phylogenetics

Phylogenetics

Overview

Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:

  • MAFFT — Multiple sequence alignment

  • IQ-TREE 2 — Maximum likelihood tree inference with model selection

  • FastTree — Fast approximate maximum likelihood (for large datasets)

  • ETE3 — Python library for tree manipulation and visualization

Installation:

Conda (recommended for CLI tools)

conda install -c bioconda mafft iqtree fasttree pip install ete3

When to Use This Skill

Use phylogenetics when:

  • Evolutionary relationships: Which organism/gene is most closely related to my sequence?

  • Viral phylodynamics: Trace outbreak spread and estimate transmission dates

  • Protein family analysis: Infer evolutionary relationships within a gene family

  • Horizontal gene transfer detection: Identify genes with discordant species/gene trees

  • Ancestral sequence reconstruction: Infer ancestral protein sequences

  • Molecular clock analysis: Estimate divergence dates using temporal sampling

  • GWAS companion: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)

  • Microbiology: Species phylogeny from 16S rRNA or core genome phylogeny

Standard Workflow

  1. Multiple Sequence Alignment with MAFFT

import subprocess import os

def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto", n_threads: int = 4) -> str: """ Align sequences with MAFFT.

Args:
    input_fasta: Path to unaligned FASTA file
    output_fasta: Path for aligned output
    method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow),
            'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast)
    n_threads: Number of CPU threads

Returns:
    Path to aligned FASTA file
"""
methods = {
    "auto": ["mafft", "--auto"],
    "einsi": ["mafft", "--genafpair", "--maxiterate", "1000"],
    "linsi": ["mafft", "--localpair", "--maxiterate", "1000"],
    "fftnsi": ["mafft", "--fftnsi"],
    "fftns": ["mafft", "--fftns"],
    "retree2": ["mafft", "--retree", "2"],
}

cmd = methods.get(method, methods["auto"])
cmd += ["--thread", str(n_threads), "--inputorder", input_fasta]

with open(output_fasta, 'w') as out:
    result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    raise RuntimeError(f"MAFFT failed:\n{result.stderr}")

# Count aligned sequences
with open(output_fasta) as f:
    n_seqs = sum(1 for line in f if line.startswith('>'))
print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}")

return output_fasta

MAFFT method selection guide:

Few sequences (<200), accurate: linsi or einsi

Many sequences (<1000), moderate: fftnsi

Large datasets (>1000): fftns or auto

Ultra-fast (>10000): mafft --retree 1

  1. Trim Alignment (Optional but Recommended)

def trim_alignment_trimal(aligned_fasta: str, output_fasta: str, method: str = "automated1") -> str: """ Trim poorly aligned columns with TrimAl.

Methods:
- 'automated1': Automatic heuristic (recommended)
- 'gappyout': Remove gappy columns
- 'strict': Strict gap threshold
"""
cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
    print(f"TrimAl warning: {result.stderr}")
    # Fall back to using the untrimmed alignment
    import shutil
    shutil.copy(aligned_fasta, output_fasta)
return output_fasta

3. IQ-TREE 2 — Maximum Likelihood Tree

def run_iqtree(aligned_fasta: str, output_prefix: str, model: str = "TEST", bootstrap: int = 1000, n_threads: int = 4, extra_args: list = None) -> dict: """ Build a maximum likelihood tree with IQ-TREE 2.

Args:
    aligned_fasta: Aligned FASTA file
    output_prefix: Prefix for output files
    model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA,
           'LG+G4' for proteins, 'JTT+G' for proteins)
    bootstrap: Number of ultrafast bootstrap replicates (1000 recommended)
    n_threads: Number of threads ('AUTO' to auto-detect)
    extra_args: Additional IQ-TREE arguments

Returns:
    Dict with paths to output files
"""
cmd = [
    "iqtree2",
    "-s", aligned_fasta,
    "--prefix", output_prefix,
    "-m", model,
    "-B", str(bootstrap),   # Ultrafast bootstrap
    "-T", str(n_threads),
    "--redo"                # Overwrite existing results
]

if extra_args:
    cmd.extend(extra_args)

result = subprocess.run(cmd, capture_output=True, text=True)

if result.returncode != 0:
    raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}")

# Print model selection result
log_file = f"{output_prefix}.log"
if os.path.exists(log_file):
    with open(log_file) as f:
        for line in f:
            if "Best-fit model" in line:
                print(f"IQ-TREE: {line.strip()}")

output_files = {
    "tree": f"{output_prefix}.treefile",
    "log": f"{output_prefix}.log",
    "iqtree": f"{output_prefix}.iqtree",  # Full report
    "model": f"{output_prefix}.model.gz",
}

print(f"IQ-TREE: Tree saved to {output_files['tree']}")
return output_files

IQ-TREE model selection guide:

DNA: TEST → GTR+G, HKY+G, TrN+G

Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G

Codon: TEST → MG+F3X4

For temporal (molecular clock) analysis, add:

extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]

  1. FastTree — Fast Approximate ML

For large datasets (>1000 sequences) where IQ-TREE is too slow:

def run_fasttree(aligned_fasta: str, output_tree: str, sequence_type: str = "nt", model: str = "gtr", n_threads: int = 4) -> str: """ Build a fast approximate ML tree with FastTree.

Args:
    sequence_type: 'nt' for nucleotide or 'aa' for amino acid
    model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt'
"""
if sequence_type == "nt":
    cmd = ["FastTree", "-nt", "-gtr"]
else:
    cmd = ["FastTree", f"-{model}"]

cmd += [aligned_fasta]

with open(output_tree, 'w') as out:
    result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    raise RuntimeError(f"FastTree failed:\n{result.stderr}")

print(f"FastTree: Tree saved to {output_tree}")
return output_tree

5. Tree Analysis and Visualization with ETE3

from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree import matplotlib.pyplot as plt

def load_tree(tree_file: str) -> Tree: """Load a Newick tree file.""" t = Tree(tree_file) print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes") return t

def basic_tree_stats(t: Tree) -> dict: """Compute basic tree statistics.""" leaves = t.get_leaves() distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))] for l2 in leaves[:min(50, len(leaves))] if l1 != l2]

stats = {
    "n_leaves": len(leaves),
    "n_internal_nodes": len(t) - len(leaves),
    "total_branch_length": sum(n.dist for n in t.traverse()),
    "max_leaf_distance": max(distances) if distances else 0,
    "mean_leaf_distance": sum(distances)/len(distances) if distances else 0,
}
return stats

def find_mrca(t: Tree, leaf_names: list) -> Tree: """Find the most recent common ancestor of a set of leaves.""" return t.get_common_ancestor(*leaf_names)

def visualize_tree(t: Tree, output_file: str = "tree.png", show_branch_support: bool = True, color_groups: dict = None, width: int = 800) -> None: """ Render phylogenetic tree to image.

Args:
    t: ETE3 Tree object
    color_groups: Dict mapping leaf_name → color (for coloring taxa)
    show_branch_support: Show bootstrap values
"""
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = show_branch_support
ts.mode = "r"  # 'r' = rectangular, 'c' = circular

if color_groups:
    for node in t.traverse():
        if node.is_leaf() and node.name in color_groups:
            nstyle = NodeStyle()
            nstyle["fgcolor"] = color_groups[node.name]
            nstyle["size"] = 8
            node.set_style(nstyle)

t.render(output_file, tree_style=ts, w=width, units="px")
print(f"Tree saved to: {output_file}")

def midpoint_root(t: Tree) -> Tree: """Root tree at midpoint (use when outgroup unknown).""" t.set_outgroup(t.get_midpoint_outgroup()) return t

def prune_tree(t: Tree, keep_leaves: list) -> Tree: """Prune tree to keep only specified leaves.""" t.prune(keep_leaves, preserve_branch_length=True) return t

  1. Complete Analysis Script

import subprocess, os from ete3 import Tree

def full_phylogenetic_analysis( input_fasta: str, output_dir: str = "phylo_results", sequence_type: str = "nt", n_threads: int = 4, bootstrap: int = 1000, use_fasttree: bool = False ) -> dict: """ Complete phylogenetic pipeline: align → trim → tree → visualize.

Args:
    input_fasta: Unaligned FASTA
    sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein)
    use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets)
"""
os.makedirs(output_dir, exist_ok=True)
prefix = os.path.join(output_dir, "phylo")

print("=" * 50)
print("Step 1: Multiple Sequence Alignment (MAFFT)")
aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta",
                     method="auto", n_threads=n_threads)

print("\nStep 2: Tree Inference")
if use_fasttree:
    tree_file = run_fasttree(
        aligned, f"{prefix}.tree",
        sequence_type=sequence_type,
        model="gtr" if sequence_type == "nt" else "lg"
    )
else:
    model = "TEST" if sequence_type == "nt" else "TEST"
    iqtree_files = run_iqtree(
        aligned, prefix,
        model=model,
        bootstrap=bootstrap,
        n_threads=n_threads
    )
    tree_file = iqtree_files["tree"]

print("\nStep 3: Tree Analysis")
t = Tree(tree_file)
t = midpoint_root(t)

stats = basic_tree_stats(t)
print(f"Tree statistics: {stats}")

print("\nStep 4: Visualization")
visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True)

# Save rooted tree
rooted_tree_file = f"{prefix}_rooted.nwk"
t.write(format=1, outfile=rooted_tree_file)

results = {
    "aligned_fasta": aligned,
    "tree_file": tree_file,
    "rooted_tree": rooted_tree_file,
    "visualization": f"{prefix}_tree.png",
    "stats": stats
}

print("\n" + "=" * 50)
print("Phylogenetic analysis complete!")
print(f"Results in: {output_dir}/")
return results

IQ-TREE Model Guide

DNA Models

Model Description Use case

GTR+G4

General Time Reversible + Gamma Most flexible DNA model

HKY+G4

Hasegawa-Kishino-Yano + Gamma Two-rate model (common)

TrN+G4

Tamura-Nei Unequal transitions

JC

Jukes-Cantor Simplest; all rates equal

Protein Models

Model Description Use case

LG+G4

Le-Gascuel + Gamma Best average protein model

WAG+G4

Whelan-Goldman Widely used

JTT+G4

Jones-Taylor-Thornton Classical model

Q.pfam+G4

pfam-trained For Pfam-like protein families

Q.bird+G4

Bird-specific Vertebrate proteins

Tip: Use -m TEST to let IQ-TREE automatically select the best model.

Best Practices

  • Alignment quality first: Poor alignment → unreliable trees; check alignment manually

  • Use linsi for small (<200 seq), fftns or auto for large alignments

  • Model selection: Always use -m TEST for IQ-TREE unless you have a specific reason

  • Bootstrap: Use ≥1000 ultrafast bootstraps (-B 1000 ) for branch support

  • Root the tree: Unrooted trees can be misleading; use outgroup or midpoint rooting

  • FastTree for >5000 sequences: IQ-TREE becomes slow; FastTree is 10–100× faster

  • Trim long alignments: TrimAl removes unreliable columns; improves tree accuracy

  • Check for recombination in viral/bacterial sequences before building trees (RDP4 , GARD )

Additional Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

research-lookup

No summary provided by upstream source.

Repository SourceNeeds Review
Research

statistical-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

market-research-reports

No summary provided by upstream source.

Repository SourceNeeds Review
Research

research-grants

No summary provided by upstream source.

Repository SourceNeeds Review