Linkage Disequilibrium

Calculate LD statistics, prune correlated variants, and identify haplotype blocks.

PLINK LD Calculations

Pairwise r²

All pairs within window

plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results

With SNP names in output

plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs

Squared correlation matrix

plink2 --bfile data --r2-phased square --out ld_matrix

Output Format

ld_results.ld contains:

CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2

PLINK 1.9 Options

r² with D' statistics

plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime

Inter-chromosome LD

plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld

LD Pruning

Standard Pruning

Calculate pruning list

plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune

Output files:

prune.prune.in - Variants to keep

prune.prune.out - Variants to remove

Extract pruned set

plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned

Pruning Parameters

Parameter Description Common Values

Window (50) Variants per window 50-200

Step (10) Variants to shift 5-50

r² threshold (0.1) Max LD allowed 0.1-0.5

Use Cases

Strict pruning for PCA/Admixture

plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune

Moderate pruning for polygenic scores

plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune

Region-based pruning

plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune

scikit-allel LD

Pairwise r²

import allel import numpy as np

callset = allel.read_vcf('data.vcf.gz') gt = allel.GenotypeArray(callset['calldata/GT']) pos = callset['variants/POS']

gn = gt.to_n_alt()

r2 = allel.rogers_huff_r(gn[:100]) ** 2

LD Decay

import allel import numpy as np import matplotlib.pyplot as plt

gn = gt.to_n_alt()

r2, dist = [], [] n_variants = min(1000, gn.shape[0])

for i in range(n_variants): for j in range(i + 1, min(i + 100, n_variants)): r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2 d = pos[j] - pos[i] r2.append(r) dist.append(d)

r2 = np.array(r2) dist = np.array(dist)

bins = np.arange(0, 100001, 1000) bin_means = [] for i in range(len(bins) - 1): mask = (dist >= bins[i]) & (dist < bins[i + 1]) if mask.sum() > 0: bin_means.append(np.mean(r2[mask])) else: bin_means.append(np.nan)

plt.figure(figsize=(10, 6)) plt.plot(bins[:-1] / 1000, bin_means) plt.xlabel('Distance (kb)') plt.ylabel('Mean r²') plt.title('LD Decay') plt.savefig('ld_decay.png')

Haplotype Blocks

PLINK

Identify haplotype blocks (Gabriel et al.)

plink --bfile data --blocks no-pheno-req --out blocks

Output: blocks.blocks (block boundaries)

Output: blocks.blocks.det (block details)

Block Statistics

import pandas as pd

blocks = pd.read_csv('blocks.blocks.det', sep='\s+')

print(f'Number of blocks: {len(blocks)}') print(f'Mean block size: {blocks["KB"].mean():.1f} kb') print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')

LD Matrix Visualization

import allel import numpy as np import matplotlib.pyplot as plt

gn = gt.to_n_alt()[:200]

r = allel.rogers_huff_r(gn) r2_matrix = r ** 2

plt.figure(figsize=(10, 10)) plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1) plt.colorbar(label='r²') plt.xlabel('Variant index') plt.ylabel('Variant index') plt.title('LD Matrix') plt.savefig('ld_matrix.png', dpi=150)

LD-based Clumping (GWAS)

Clump GWAS results by LD

plink --bfile data
--clump gwas_results.txt
--clump-p1 5e-8
--clump-p2 1e-5
--clump-r2 0.1
--clump-kb 250
--out clumped

Output: clumped.clumped (independent signals)

Clump Parameters

Parameter Description

--clump-p1 Index SNP p-value threshold

--clump-p2 Clumped SNP p-value threshold

--clump-r2 LD threshold for clumping

--clump-kb Physical distance threshold

vcftools LD

Pairwise LD for region

vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results

Output: ld_results.geno.ld

Haplotype-based r²

vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld

Complete Workflow

1. Calculate genome-wide LD

plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome

2. Generate pruned set for PCA

plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune plink2 --bfile data --extract prune.prune.in --make-bed --out pruned

3. Identify haplotype blocks