Association Testing
GWAS analysis using PLINK 2.0's unified --glm command for case-control and quantitative traits.
PLINK 2.0 Association Testing
Basic Case-Control (Binary Phenotype)
Basic logistic regression
plink2 --bfile data --glm --out results
With phenotype file
plink2 --bfile data --pheno pheno.txt --glm --out results
Quantitative Trait (Continuous Phenotype)
Linear regression for quantitative traits
plink2 --bfile data --pheno pheno.txt --glm --out results
With Covariates
Include covariates (sex, age, PCs)
plink2 --bfile data
--pheno pheno.txt
--covar covariates.txt
--glm --out results
Specify which covariates to use
plink2 --bfile data
--pheno pheno.txt
--covar covariates.txt
--covar-name PC1,PC2,PC3,age,sex
--glm --out results
Covariate Files
Phenotype File Format
pheno.txt: FID IID pheno
For binary: 1=control, 2=case, -9=missing
For quantitative: continuous values
FAM001 IND001 2 FAM002 IND002 1 FAM003 IND003 1.5
Covariate File Format
covariates.txt: FID IID cov1 cov2 ...
FAM001 IND001 0.15 35 1 FAM002 IND002 -0.22 42 2 FAM003 IND003 0.08 28 1
GLM Options
Phenotype Handling
Multiple phenotypes (test all)
plink2 --bfile data --pheno pheno_multi.txt --glm --out results
Specific phenotype column
plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results
Missing phenotype handling
plink2 --bfile data --glm allow-no-covars --out results
Model Options
Additive model (default)
plink2 --bfile data --glm --out results
Dominant model
plink2 --bfile data --glm dominant --out results
Recessive model
plink2 --bfile data --glm recessive --out results
Genotypic (2df test)
plink2 --bfile data --glm genotypic --out results
Hide covariates from output (cleaner output)
plink2 --bfile data --covar cov.txt --glm hide-covar --out results
Firth Regression (Rare Variants)
Enable Firth fallback for case-control (default in PLINK 2.0)
plink2 --bfile data --glm firth-fallback --out results
Force Firth regression
plink2 --bfile data --glm firth --out results
Disable Firth
plink2 --bfile data --glm no-firth --out results
Output Format
Output Columns
Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear
Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P
Custom Output Columns
Add specific columns
plink2 --bfile data --glm cols=+a1freq,+machr2 --out results
Available columns:
+a1freq: A1 allele frequency
+machr2: MaCH R-squared
+ax: Reference allele dosage
+err: Standard errors
Population Stratification Control
Include Principal Components
1. Run PCA
plink2 --bfile data --pca 10 --out pca_results
2. Use PCs as covariates
plink2 --bfile data
--pheno pheno.txt
--covar pca_results.eigenvec
--covar-name PC1,PC2,PC3,PC4,PC5
--glm --out results
Combined Workflow
QC, PCA, and GWAS in sequence
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc
plink2 --bfile qc --pca 10 --out pca
plink2 --bfile qc
--pheno pheno.txt
--covar pca.eigenvec
--covar-name PC1-PC5
--glm hide-covar --out gwas
Result Filtering
Command Line Filtering
Filter significant results
awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt
Extract top hits
sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt
Python Analysis
import pandas as pd
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
significant = results[results['P'] < 5e-8] print(f'Genome-wide significant hits: {len(significant)}')
suggestive = results[results['P'] < 1e-5] print(f'Suggestive hits: {len(suggestive)}')
Visualization
Manhattan Plot (Python)
import pandas as pd import matplotlib.pyplot as plt import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') results = results[results['TEST'] == 'ADD'] results['-log10P'] = -np.log10(results['P'])
chrom_colors = ['#1f77b4', '#ff7f0e'] results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])
cumulative_pos = [] offset = 0 for chrom in sorted(results['#CHROM'].unique()): chrom_data = results[results['#CHROM'] == chrom] cumulative_pos.extend(chrom_data['POS'] + offset) offset += chrom_data['POS'].max()
results['cumulative_pos'] = cumulative_pos
plt.figure(figsize=(14, 6)) plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1) plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)') plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)') plt.xlabel('Chromosome') plt.ylabel('-log10(P)') plt.legend() plt.savefig('manhattan.png', dpi=150)
QQ Plot (Python)
import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy import stats
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()
n = len(observed_p) expected_p = np.arange(1, n + 1) / (n + 1)
plt.figure(figsize=(6, 6)) plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1) plt.plot([0, 8], [0, 8], 'r--') plt.xlabel('Expected -log10(P)') plt.ylabel('Observed -log10(P)')
lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1) plt.title(f'QQ Plot (λ = {lambda_gc:.3f})') plt.savefig('qqplot.png', dpi=150)
Genomic Inflation
from scipy import stats import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') pvalues = results[results['TEST'] == 'ADD']['P'].dropna()
chisq = stats.chi2.ppf(1 - pvalues, 1) lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1) print(f'Genomic inflation factor: {lambda_gc:.3f}')
Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1
Related Skills
-
plink-basics - Data preparation and QC
-
population-structure - PCA for stratification control
-
linkage-disequilibrium - LD pruning before analysis