Linkage Disequilibrium
Calculate LD statistics, prune correlated variants, and identify haplotype blocks.
PLINK LD Calculations
Pairwise r²
All pairs within window
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results
With SNP names in output
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs
Squared correlation matrix
plink2 --bfile data --r2-phased square --out ld_matrix
Output Format
ld_results.ld contains:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
PLINK 1.9 Options
r² with D' statistics
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime
Inter-chromosome LD
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld
LD Pruning
Standard Pruning
Calculate pruning list
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
Output files:
prune.prune.in - Variants to keep
prune.prune.out - Variants to remove
Extract pruned set
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
Pruning Parameters
Parameter Description Common Values
Window (50) Variants per window 50-200
Step (10) Variants to shift 5-50
r² threshold (0.1) Max LD allowed 0.1-0.5
Use Cases
Strict pruning for PCA/Admixture
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune
Moderate pruning for polygenic scores
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune
Region-based pruning
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune
scikit-allel LD
Pairwise r²
import allel import numpy as np
callset = allel.read_vcf('data.vcf.gz') gt = allel.GenotypeArray(callset['calldata/GT']) pos = callset['variants/POS']
gn = gt.to_n_alt()
r2 = allel.rogers_huff_r(gn[:100]) ** 2
LD Decay
import allel import numpy as np import matplotlib.pyplot as plt
gn = gt.to_n_alt()
r2, dist = [], [] n_variants = min(1000, gn.shape[0])
for i in range(n_variants): for j in range(i + 1, min(i + 100, n_variants)): r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2 d = pos[j] - pos[i] r2.append(r) dist.append(d)
r2 = np.array(r2) dist = np.array(dist)
bins = np.arange(0, 100001, 1000) bin_means = [] for i in range(len(bins) - 1): mask = (dist >= bins[i]) & (dist < bins[i + 1]) if mask.sum() > 0: bin_means.append(np.mean(r2[mask])) else: bin_means.append(np.nan)
plt.figure(figsize=(10, 6)) plt.plot(bins[:-1] / 1000, bin_means) plt.xlabel('Distance (kb)') plt.ylabel('Mean r²') plt.title('LD Decay') plt.savefig('ld_decay.png')
Haplotype Blocks
PLINK
Identify haplotype blocks (Gabriel et al.)
plink --bfile data --blocks no-pheno-req --out blocks
Output: blocks.blocks (block boundaries)
Output: blocks.blocks.det (block details)
Block Statistics
import pandas as pd
blocks = pd.read_csv('blocks.blocks.det', sep='\s+')
print(f'Number of blocks: {len(blocks)}') print(f'Mean block size: {blocks["KB"].mean():.1f} kb') print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')
LD Matrix Visualization
import allel import numpy as np import matplotlib.pyplot as plt
gn = gt.to_n_alt()[:200]
r = allel.rogers_huff_r(gn) r2_matrix = r ** 2
plt.figure(figsize=(10, 10)) plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1) plt.colorbar(label='r²') plt.xlabel('Variant index') plt.ylabel('Variant index') plt.title('LD Matrix') plt.savefig('ld_matrix.png', dpi=150)
LD-based Clumping (GWAS)
Clump GWAS results by LD
plink --bfile data
--clump gwas_results.txt
--clump-p1 5e-8
--clump-p2 1e-5
--clump-r2 0.1
--clump-kb 250
--out clumped
Output: clumped.clumped (independent signals)
Clump Parameters
Parameter Description
--clump-p1 Index SNP p-value threshold
--clump-p2 Clumped SNP p-value threshold
--clump-r2 LD threshold for clumping
--clump-kb Physical distance threshold
vcftools LD
Pairwise LD for region
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results
Output: ld_results.geno.ld
Haplotype-based r²
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld
Complete Workflow
1. Calculate genome-wide LD
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome
2. Generate pruned set for PCA
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune plink2 --bfile data --extract prune.prune.in --make-bed --out pruned
3. Identify haplotype blocks
plink --bfile data --blocks no-pheno-req --out blocks
4. Visualize LD for specific region
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld
Related Skills
-
plink-basics - File format handling
-
population-structure - Use pruned data for PCA
-
association-testing - LD clumping for GWAS
-
selection-statistics - LD affects EHH statistics