Phylogenetics

Overview

Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:

MAFFT — Multiple sequence alignment
IQ-TREE 2 — Maximum likelihood tree inference with model selection
FastTree — Fast approximate maximum likelihood (for large datasets)
ETE3 — Python library for tree manipulation and visualization

Installation:

Conda (recommended for CLI tools)

conda install -c bioconda mafft iqtree fasttree pip install ete3

When to Use This Skill

Use phylogenetics when:

Evolutionary relationships: Which organism/gene is most closely related to my sequence?
Viral phylodynamics: Trace outbreak spread and estimate transmission dates
Protein family analysis: Infer evolutionary relationships within a gene family
Horizontal gene transfer detection: Identify genes with discordant species/gene trees
Ancestral sequence reconstruction: Infer ancestral protein sequences
Molecular clock analysis: Estimate divergence dates using temporal sampling
GWAS companion: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)
Microbiology: Species phylogeny from 16S rRNA or core genome phylogeny

Standard Workflow

Multiple Sequence Alignment with MAFFT

import subprocess import os

def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto", n_threads: int = 4) -> str: """ Align sequences with MAFFT.

Args:
    input_fasta: Path to unaligned FASTA file
    output_fasta: Path for aligned output
    method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow),
            'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast)
    n_threads: Number of CPU threads

Returns:
    Path to aligned FASTA file
"""
methods = {
    "auto": ["mafft", "--auto"],
    "einsi": ["mafft", "--genafpair", "--maxiterate", "1000"],
    "linsi": ["mafft", "--localpair", "--maxiterate", "1000"],
    "fftnsi": ["mafft", "--fftnsi"],
    "fftns": ["mafft", "--fftns"],
    "retree2": ["mafft", "--retree", "2"],
}

cmd = methods.get(method, methods["auto"])
cmd += ["--thread", str(n_threads), "--inputorder", input_fasta]

with open(output_fasta, 'w') as out:
    result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    raise RuntimeError(f"MAFFT failed:\n{result.stderr}")

# Count aligned sequences
with open(output_fasta) as f:
    n_seqs = sum(1 for line in f if line.startswith('>'))
print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}")

return output_fasta

MAFFT method selection guide:

Few sequences (<200), accurate: linsi or einsi

Many sequences (<1000), moderate: fftnsi

Large datasets (>1000): fftns or auto

Ultra-fast (>10000): mafft --retree 1

Trim Alignment (Optional but Recommended)

def trim_alignment_trimal(aligned_fasta: str, output_fasta: str, method: str = "automated1") -> str: """ Trim poorly aligned columns with TrimAl.

Methods:
- 'automated1': Automatic heuristic (recommended)
- 'gappyout': Remove gappy columns
- 'strict': Strict gap threshold
"""
cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
    print(f"TrimAl warning: {result.stderr}")
    # Fall back to using the untrimmed alignment
    import shutil
    shutil.copy(aligned_fasta, output_fasta)
return output_fasta

3. IQ-TREE 2 — Maximum Likelihood Tree

def run_iqtree(aligned_fasta: str, output_prefix: str, model: str = "TEST", bootstrap: int = 1000, n_threads: int = 4, extra_args: list = None) -> dict: """ Build a maximum likelihood tree with IQ-TREE 2.

Args:
    aligned_fasta: Aligned FASTA file
    output_prefix: Prefix for output files
    model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA,
           'LG+G4' for proteins, 'JTT+G' for proteins)
    bootstrap: Number of ultrafast bootstrap replicates (1000 recommended)
    n_threads: Number of threads ('AUTO' to auto-detect)
    extra_args: Additional IQ-TREE arguments

Returns:
    Dict with paths to output files
"""
cmd = [
    "iqtree2",
    "-s", aligned_fasta,
    "--prefix", output_prefix,
    "-m", model,
    "-B", str(bootstrap),   # Ultrafast bootstrap
    "-T", str(n_threads),
    "--redo"                # Overwrite existing results
]

if extra_args:
    cmd.extend(extra_args)

result = subprocess.run(cmd, capture_output=True, text=True)

if result.returncode != 0:
    raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}")

# Print model selection result
log_file = f"{output_prefix}.log"
if os.path.exists(log_file):
    with open(log_file) as f:
        for line in f:
            if "Best-fit model" in line:
                print(f"IQ-TREE: {line.strip()}")

output_files = {
    "tree": f"{output_prefix}.treefile",
    "log": f"{output_prefix}.log",
    "iqtree": f"{output_prefix}.iqtree",  # Full report
    "model": f"{output_prefix}.model.gz",
}

print(f"IQ-TREE: Tree saved to {output_files['tree']}")
return output_files

IQ-TREE model selection guide:

DNA: TEST → GTR+G, HKY+G, TrN+G

Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G

Codon: TEST → MG+F3X4

For temporal (molecular clock) analysis, add:

extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]

FastTree — Fast Approximate ML

For large datasets (>1000 sequences) where IQ-TREE is too slow:

def run_fasttree(aligned_fasta: str, output_tree: str, sequence_type: str = "nt", model: str = "gtr", n_threads: int = 4) -> str: """ Build a fast approximate ML tree with FastTree.

Args:
    sequence_type: 'nt' for nucleotide or 'aa' for amino acid
    model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt'
"""
if sequence_type == "nt":
    cmd = ["FastTree", "-nt", "-gtr"]
else:
    cmd = ["FastTree", f"-{model}"]

cmd += [aligned_fasta]

with open(output_tree, 'w') as out:
    result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    raise RuntimeError(f"FastTree failed:\n{result.stderr}")

print(f"FastTree: Tree saved to {output_tree}")
return output_tree

5. Tree Analysis and Visualization with ETE3

from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree import matplotlib.pyplot as plt

def load_tree(tree_file: str) -> Tree: """Load a Newick tree file.""" t = Tree(tree_file) print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes") return t

def basic_tree_stats(t: Tree) -> dict: """Compute basic tree statistics.""" leaves = t.get_leaves() distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))] for l2 in leaves[:min(50, len(leaves))] if l1 != l2]

stats = {
    "n_leaves": len(leaves),
    "n_internal_nodes": len(t) - len(leaves),
    "total_branch_length": sum(n.dist for n in t.traverse()),
    "max_leaf_distance": max(distances) if distances else 0,
    "mean_leaf_distance": sum(distances)/len(distances) if distances else 0,
}
return stats

def find_mrca(t: Tree, leaf_names: list) -> Tree: """Find the most recent common ancestor of a set of leaves.""" return t.get_common_ancestor(*leaf_names)

def visualize_tree(t: Tree, output_file: str = "tree.png", show_branch_support: bool = True, color_groups: dict = None, width: int = 800) -> None: """ Render phylogenetic tree to image.

Args:
    t: ETE3 Tree object
    color_groups: Dict mapping leaf_name → color (for coloring taxa)
    show_branch_support: Show bootstrap values
"""
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = show_branch_support
ts.mode = "r"  # 'r' = rectangular, 'c' = circular

if color_groups:
    for node in t.traverse():
        if node.is_leaf() and node.name in color_groups:
            nstyle = NodeStyle()
            nstyle["fgcolor"] = color_groups[node.name]
            nstyle["size"] = 8
            node.set_style(nstyle)

t.render(output_file, tree_style=ts, w=width, units="px")
print(f"Tree saved to: {output_file}")

def midpoint_root(t: Tree) -> Tree: """Root tree at midpoint (use when outgroup unknown).""" t.set_outgroup(t.get_midpoint_outgroup()) return t

def prune_tree(t: Tree, keep_leaves: list) -> Tree: """Prune tree to keep only specified leaves.""" t.prune(keep_leaves, preserve_branch_length=True) return t

Complete Analysis Script

import subprocess, os from ete3 import Tree

def full_phylogenetic_analysis( input_fasta: str, output_dir: str = "phylo_results", sequence_type: str = "nt", n_threads: int = 4, bootstrap: int = 1000, use_fasttree: bool = False ) -> dict: """ Complete phylogenetic pipeline: align → trim → tree → visualize.

Args:
    input_fasta: Unaligned FASTA
    sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein)
    use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets)
"""
os.makedirs(output_dir, exist_ok=True)
prefix = os.path.join(output_dir, "phylo")

print("=" * 50)
print("Step 1: Multiple Sequence Alignment (MAFFT)")
aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta",
                     method="auto", n_threads=n_threads)

print("\nStep 2: Tree Inference")
if use_fasttree:
    tree_file = run_fasttree(
        aligned, f"{prefix}.tree",
        sequence_type=sequence_type,
        model="gtr" if sequence_type == "nt" else "lg"
    )
else:
    model = "TEST" if sequence_type == "nt" else "TEST"
    iqtree_files = run_iqtree(
        aligned, prefix,
        model=model,
        bootstrap=bootstrap,
        n_threads=n_threads
    )
    tree_file = iqtree_files["tree"]

print("\nStep 3: Tree Analysis")
t = Tree(tree_file)
t = midpoint_root(t)

stats = basic_tree_stats(t)
print(f"Tree statistics: {stats}")

print("\nStep 4: Visualization")
visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True)

# Save rooted tree
rooted_tree_file = f"{prefix}_rooted.nwk"
t.write(format=1, outfile=rooted_tree_file)

results = {
    "aligned_fasta": aligned,
    "tree_file": tree_file,
    "rooted_tree": rooted_tree_file,
    "visualization": f"{prefix}_tree.png",
    "stats": stats
}

print("\n" + "=" * 50)
print("Phylogenetic analysis complete!")
print(f"Results in: {output_dir}/")
return results

IQ-TREE Model Guide

DNA Models

Model Description Use case

GTR+G4

General Time Reversible + Gamma Most flexible DNA model

HKY+G4

Hasegawa-Kishino-Yano + Gamma Two-rate model (common)

TrN+G4

Tamura-Nei Unequal transitions

Jukes-Cantor Simplest; all rates equal

Protein Models

Model Description Use case

LG+G4

Le-Gascuel + Gamma Best average protein model

WAG+G4

Whelan-Goldman Widely used

JTT+G4

Jones-Taylor-Thornton Classical model

Q.pfam+G4

pfam-trained For Pfam-like protein families

Q.bird+G4

Bird-specific Vertebrate proteins

Tip: Use -m TEST to let IQ-TREE automatically select the best model.

Best Practices

Alignment quality first: Poor alignment → unreliable trees; check alignment manually
Use linsi for small (<200 seq), fftns or auto for large alignments
Model selection: Always use -m TEST for IQ-TREE unless you have a specific reason
Bootstrap: Use ≥1000 ultrafast bootstraps (-B 1000 ) for branch support
Root the tree: Unrooted trees can be misleading; use outgroup or midpoint rooting
FastTree for >5000 sequences: IQ-TREE becomes slow; FastTree is 10–100× faster
Trim long alignments: TrimAl removes unreliable columns; improves tree accuracy
Check for recombination in viral/bacterial sequences before building trees (RDP4 , GARD )

Additional Resources

MAFFT: https://mafft.cbrc.jp/alignment/software/
IQ-TREE 2: http://www.iqtree.org/ | Tutorial: https://www.iqtree.org/workshop/molevol2022
FastTree: http://www.microbesonline.org/fasttree/
ETE3: http://etetoolkit.org/
FigTree (GUI visualization): https://tree.bio.ed.ac.uk/software/figtree/
iTOL (web visualization): https://itol.embl.de/
MUSCLE (alternative aligner): https://www.drive5.com/muscle/
TrimAl (alignment trimming): https://vicfero.github.io/trimal/

phylogenetics

Safety Notice

Copy this and send it to your AI assistant to learn

Conda (recommended for CLI tools)

MAFFT method selection guide:

Few sequences (<200), accurate: linsi or einsi

Many sequences (<1000), moderate: fftnsi

Large datasets (>1000): fftns or auto

Ultra-fast (>10000): mafft --retree 1

IQ-TREE model selection guide:

DNA: TEST → GTR+G, HKY+G, TrN+G

Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G

Codon: TEST → MG+F3X4

For temporal (molecular clock) analysis, add:

extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]

Source Transparency

Related Skills

research-lookup

statistical-analysis

market-research-reports

research-grants