bio-population-genetics-association-testing

GWAS analysis using PLINK 2.0's unified --glm command for case-control and quantitative traits.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-population-genetics-association-testing" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-population-genetics-association-testing

Association Testing

GWAS analysis using PLINK 2.0's unified --glm command for case-control and quantitative traits.

PLINK 2.0 Association Testing

Basic Case-Control (Binary Phenotype)

Basic logistic regression

plink2 --bfile data --glm --out results

With phenotype file

plink2 --bfile data --pheno pheno.txt --glm --out results

Quantitative Trait (Continuous Phenotype)

Linear regression for quantitative traits

plink2 --bfile data --pheno pheno.txt --glm --out results

With Covariates

Include covariates (sex, age, PCs)

plink2 --bfile data
--pheno pheno.txt
--covar covariates.txt
--glm --out results

Specify which covariates to use

plink2 --bfile data
--pheno pheno.txt
--covar covariates.txt
--covar-name PC1,PC2,PC3,age,sex
--glm --out results

Covariate Files

Phenotype File Format

pheno.txt: FID IID pheno

For binary: 1=control, 2=case, -9=missing

For quantitative: continuous values

FAM001 IND001 2 FAM002 IND002 1 FAM003 IND003 1.5

Covariate File Format

covariates.txt: FID IID cov1 cov2 ...

FAM001 IND001 0.15 35 1 FAM002 IND002 -0.22 42 2 FAM003 IND003 0.08 28 1

GLM Options

Phenotype Handling

Multiple phenotypes (test all)

plink2 --bfile data --pheno pheno_multi.txt --glm --out results

Specific phenotype column

plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results

Missing phenotype handling

plink2 --bfile data --glm allow-no-covars --out results

Model Options

Additive model (default)

plink2 --bfile data --glm --out results

Dominant model

plink2 --bfile data --glm dominant --out results

Recessive model

plink2 --bfile data --glm recessive --out results

Genotypic (2df test)

plink2 --bfile data --glm genotypic --out results

Hide covariates from output (cleaner output)

plink2 --bfile data --covar cov.txt --glm hide-covar --out results

Firth Regression (Rare Variants)

Enable Firth fallback for case-control (default in PLINK 2.0)

plink2 --bfile data --glm firth-fallback --out results

Force Firth regression

plink2 --bfile data --glm firth --out results

Disable Firth

plink2 --bfile data --glm no-firth --out results

Output Format

Output Columns

Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear

Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P

Custom Output Columns

Add specific columns

plink2 --bfile data --glm cols=+a1freq,+machr2 --out results

Available columns:

+a1freq: A1 allele frequency

+machr2: MaCH R-squared

+ax: Reference allele dosage

+err: Standard errors

Population Stratification Control

Include Principal Components

1. Run PCA

plink2 --bfile data --pca 10 --out pca_results

2. Use PCs as covariates

plink2 --bfile data
--pheno pheno.txt
--covar pca_results.eigenvec
--covar-name PC1,PC2,PC3,PC4,PC5
--glm --out results

Combined Workflow

QC, PCA, and GWAS in sequence

plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc plink2 --bfile qc --pca 10 --out pca plink2 --bfile qc
--pheno pheno.txt
--covar pca.eigenvec
--covar-name PC1-PC5
--glm hide-covar --out gwas

Result Filtering

Command Line Filtering

Filter significant results

awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt

Extract top hits

sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt

Python Analysis

import pandas as pd

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')

significant = results[results['P'] < 5e-8] print(f'Genome-wide significant hits: {len(significant)}')

suggestive = results[results['P'] < 1e-5] print(f'Suggestive hits: {len(suggestive)}')

Visualization

Manhattan Plot (Python)

import pandas as pd import matplotlib.pyplot as plt import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') results = results[results['TEST'] == 'ADD'] results['-log10P'] = -np.log10(results['P'])

chrom_colors = ['#1f77b4', '#ff7f0e'] results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])

cumulative_pos = [] offset = 0 for chrom in sorted(results['#CHROM'].unique()): chrom_data = results[results['#CHROM'] == chrom] cumulative_pos.extend(chrom_data['POS'] + offset) offset += chrom_data['POS'].max()

results['cumulative_pos'] = cumulative_pos

plt.figure(figsize=(14, 6)) plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1) plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)') plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)') plt.xlabel('Chromosome') plt.ylabel('-log10(P)') plt.legend() plt.savefig('manhattan.png', dpi=150)

QQ Plot (Python)

import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy import stats

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()

n = len(observed_p) expected_p = np.arange(1, n + 1) / (n + 1)

plt.figure(figsize=(6, 6)) plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1) plt.plot([0, 8], [0, 8], 'r--') plt.xlabel('Expected -log10(P)') plt.ylabel('Observed -log10(P)')

lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1) plt.title(f'QQ Plot (λ = {lambda_gc:.3f})') plt.savefig('qqplot.png', dpi=150)

Genomic Inflation

from scipy import stats import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t') pvalues = results[results['TEST'] == 'ADD']['P'].dropna()

chisq = stats.chi2.ppf(1 - pvalues, 1) lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1) print(f'Genomic inflation factor: {lambda_gc:.3f}')

Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1

Related Skills

  • plink-basics - Data preparation and QC

  • population-structure - PCA for stratification control

  • linkage-disequilibrium - LD pruning before analysis

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

bio-microbiome-diversity-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-proteomics-dia-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-metabolomics-statistical-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-restriction-fragment-analysis

No summary provided by upstream source.

Repository SourceNeeds Review