Advanced GWAS: LD Pruning, Mixed Models, Fine-mapping

GWAS Concepts and Design

A genome-wide association study (GWAS) tests whether each SNP in the genome is statistically associated with a phenotype of interest. It is essentially a massive regression: for each SNP, regress phenotype on allele count (0, 1, or 2 copies of the minor allele).

Component	What it is	Key decision
Genotype matrix	N individuals x M SNPs; each cell = 0, 1, or 2 copies of minor allele	SNP QC: MAF, missing rate, HWE; LD pruning for PCA
Phenotype	Quantitative (height, yield) or binary (case/control)	Normalise continuous traits; use logistic regression for binary
Covariates	Variables that confound the SNP-phenotype association	Always include: PC1-PC5 (population structure) + any known batch effects
Association model	Regression of phenotype on genotype, corrected for covariates	Linear mixed model (LMM) for structured populations; logistic regression for case/control
Multiple testing	M tests performed; Bonferroni threshold = 0.05/M	Bonferroni for genome-wide; FDR 5% for discovery with replication

Step 1 — GWAS Quality Control and LD Pruning

Bash -- PLINK2 GWAS quality control pipeline

## ================================================================
## GWAS QC with PLINK2
## Dataset: Rice 3K SNP array or WGS VCF
## ================================================================

VCF="rice3k_snps.vcf.gz"

## Step 1: Convert VCF to PLINK binary format
plink2 --vcf ${VCF} \
  --allow-extra-chr --double-id \
  --make-bed --out rice3k_raw

## Step 2: Per-sample QC
plink2 --bfile rice3k_raw \
  --mind 0.05 \       ## remove samples with >5% missing genotypes
  --het \             ## compute heterozygosity (outliers = contamination)
  --out rice3k_sampleqc

## Remove heterozygosity outliers (>3 SD from mean)
Rscript -e '
  het <- read.table("rice3k_sampleqc.het", header=TRUE)
  mean_f <- mean(het$F); sd_f <- sd(het$F)
  fail <- het[abs(het$F - mean_f) > 3 * sd_f, "IID"]
  write.table(data.frame(FID=fail, IID=fail), "het_outliers.txt",
              row.names=FALSE, col.names=FALSE, quote=FALSE)
'
plink2 --bfile rice3k_raw \
  --remove het_outliers.txt \
  --make-bed --out rice3k_sampQC

## Step 3: Per-SNP QC
plink2 --bfile rice3k_sampQC \
  --maf 0.01 \            ## remove MAF < 1%
  --geno 0.05 \           ## remove SNPs with >5% missing rate
  --hwe 1e-6 \            ## remove SNPs failing HWE at p<1e-6
  --snps-only \           ## SNPs only (no indels)
  --make-bed --out rice3k_snpQC

## Step 4: LD pruning for PCA and relatedness (NOT for GWAS itself)
## PCA should be done on LD-pruned SNPs to avoid PC axes being
## dominated by large LD blocks
plink2 --bfile rice3k_snpQC \
  --indep-pairwise 50 10 0.2 \  ## 50-SNP window, step 10, r2<0.2
  --out rice3k_pruning

## Step 5: PCA on LD-pruned SNPs
plink2 --bfile rice3k_snpQC \
  --extract rice3k_pruning.prune.in \
  --pca 20 \   ## top 20 PCs (use PC1-PC5 as GWAS covariates)
  --out rice3k_pca

## Step 6: Relatedness check
plink2 --bfile rice3k_snpQC \
  --extract rice3k_pruning.prune.in \
  --king-cutoff 0.125 \  ## remove one of each pair with kinship > 0.125 (2nd degree)
  --out rice3k_unrelated

echo "Final unrelated samples: $(wc -l < rice3k_unrelated.king.cutoff.in.id)"

Download:

Step 2 — Population Stratification: Why It Matters

Population stratification is the most common source of false positives in GWAS. If allele frequencies differ between subpopulations AND the phenotype also differs between those subpopulations (for any reason), you get spurious associations. The classic solution is to include top principal components (PCs) as covariates.

Genomic Inflation Factor (Lambda)

lambda = median(observed chi2) / 0.4549

0.4549 = median of chi2 distribution with 1 df (expected under null)

lambda = 1.0No inflation -- association test is well-calibrated under the null lambda > 1.1Inflation -- likely cryptic population structure, batch effects, or relatedness. Add more PCs as covariates. lambda > 1.5Severe inflation -- major stratification issue. Check ADMIXTURE plot. May need population-specific GWAS or LMM. lambda < 1.0Deflation -- model is over-correcting (too many PCs). Reduce PC number.

Step 3 — Association Models: GLM vs LMM

Model	Formula	When to use	Tool
Linear regression (GLM)	y = SNP*beta + covariates + epsilon	Unstructured populations; after removing all related individuals and correcting PCs. Fastest.	PLINK2 --linear
Logistic regression	log(p/1-p) = SNP*beta + covariates	Binary phenotypes (case/control). Required for correct p-value calibration with binary traits.	PLINK2 --logistic
Linear mixed model (LMM)	y = SNPbeta + Zu + epsilon; u ~ N(0, sigma2K)	Structured populations (crops, inbred lines, related samples). Accounts for kinship via the genetic relatedness matrix K.	GEMMA, EMMAX, GAPIT3
BGLMM (Bayesian)	Full Bayesian LMM with prior on effect sizes	When effect sizes are of interest; when proportion of causal SNPs is small. Slower but better calibrated.	BGLMM, BayesC, rrBLUP

Bash + R -- GEMMA LMM GWAS + GAPIT3 MLM pipeline

## ================================================================
## GEMMA LMM GWAS -- for structured populations (crops, inbreds)
## GEMMA models population structure via the kinship matrix K
## ================================================================

BFILE="rice3k_snpQC"   ## PLINK binary
PHENO="rice_phenotypes.txt"  ## tab-separated: FID IID trait1 trait2 ...

## Step 1: Compute centered relatedness matrix (kinship)
gemma -bfile ${BFILE} \
  -gk 1 \           ## method 1 = centered kinship (Amin et al.)
  -o rice3k_kinship
## Output: output/rice3k_kinship.cXX.txt

## Step 2: Run LMM GWAS for each trait
## -lmm 4 = output all test statistics (Wald, likelihood ratio, score test)
gemma -bfile ${BFILE} \
  -k output/rice3k_kinship.cXX.txt \
  -lmm 4 \
  -n 1 \            ## column 1 of phenotype file (first trait)
  -o rice3k_trait1_lmm
## Output: output/rice3k_trait1_lmm.assoc.txt

## Step 3: GAPIT3 MLM in R (user-friendly alternative to GEMMA)
## install.packages("devtools")
## devtools::install_github("jiabowang/GAPIT3", force=TRUE)
library(GAPIT3)

myGD <- read.table("rice3k_snps_numeric.txt", header=TRUE) ## numeric genotype matrix
myGM <- read.table("rice3k_snp_map.txt", header=TRUE)      ## SNP map: SNP chr pos
myY  <- read.table("rice3k_phenotypes.txt", header=TRUE)   ## phenotype: taxaID + traits

## Run MLM (Mixed Linear Model) GWAS with PCA covariates
myGAPIT <- GAPIT(
  Y = myY,
  GD = myGD,
  GM = myGM,
  model = "MLM",
  PCA.total = 5,    ## use top 5 PCs as covariates
  SNP.MAF = 0.01,   ## minor allele frequency filter
  file.output = TRUE
)
## Outputs: GAPIT.MLM.*.GWAS.Results.csv -- SNP, Chr, Pos, P.value, MAF, Effect

Download:

Step 4 — Significance Thresholds: Bonferroni vs FDR vs Permutation

Step 5 — Manhattan and QQ Plots

R -- Publication-quality Manhattan and QQ plots with ggplot2

library(ggplot2); library(dplyr)

## Load GEMMA results
gwas <- read.table("output/rice3k_trait1_lmm.assoc.txt",
                   header=TRUE, sep="\t")
gwas$LOG10P <- -log10(gwas$p_lrt)   ## use likelihood ratio test p-value

## Prepare chromosome positions for Manhattan plot
gwas <- gwas %>%
  group_by(chr) %>%
  mutate(chr_max = max(ps)) %>%
  ungroup() %>%
  arrange(chr, ps)

## Cumulative positions
gwas <- gwas %>%
  mutate(chr = as.numeric(chr)) %>%
  group_by(chr) %>%
  mutate(pos_cum = ps + cumsum(c(0, diff(ps)))[1]) %>%
  ungroup()

chr_offsets <- gwas %>%
  group_by(chr) %>%
  summarise(start_cum = min(pos_cum), center = mean(pos_cum), .groups="drop")

## Build cumulative positions across chromosomes
offset <- 0; gwas$pos_plot <- 0
for(ch in sort(unique(gwas$chr))) {
  mask <- gwas$chr == ch
  gwas$pos_plot[mask] <- gwas$ps[mask] + offset
  offset <- offset + max(gwas$ps[mask]) + 5e7
  chr_offsets$center[chr_offsets$chr == ch] <- mean(gwas$pos_plot[mask])
}

bonf <- -log10(0.05 / nrow(gwas))
sugg <- -log10(1 / nrow(gwas))

## Manhattan plot
col_odd  <- "#2166ac"; col_even <- "#74add1"
gwas$colour <- ifelse(gwas$chr %% 2 == 0, col_even, col_odd)
gwas$colour[gwas$LOG10P > bonf] <- "#e31a1c"  ## significant = red

ggplot(gwas, aes(pos_plot, LOG10P, colour=I(colour))) +
  geom_point(size=0.4, alpha=0.8) +
  geom_hline(yintercept=bonf, linetype="dashed", colour="red",   linewidth=0.6) +
  geom_hline(yintercept=sugg, linetype="dotted", colour="orange", linewidth=0.5) +
  scale_x_continuous(breaks=chr_offsets$center,
                     labels=paste0("Chr", chr_offsets$chr)) +
  scale_y_continuous(name="-log10(p)", expand=c(0.02,0)) +
  labs(title="GWAS Manhattan Plot",
       subtitle=paste0(nrow(gwas), " SNPs; red dashed = Bonferroni; orange dotted = suggestive")) +
  theme_minimal(base_size=11) +
  theme(axis.text.x=element_text(angle=45, hjust=1, size=8),
        panel.grid.major.x=element_blank())
ggsave("gwas_manhattan.png", width=14, height=5, dpi=200)

## QQ plot (quantile-quantile) -- diagnoses inflation
observed <- sort(-log10(gwas$p_lrt))
expected <- sort(-log10(ppoints(nrow(gwas))))
lambda   <- round(median(qchisq(gwas$p_lrt, df=1, lower.tail=FALSE), na.rm=TRUE) / 0.4549, 3)

ggplot(data.frame(expected, observed), aes(expected, observed)) +
  geom_point(size=0.5, colour="#2166ac", alpha=0.6) +
  geom_abline(slope=1, intercept=0, colour="red", linewidth=0.7) +
  labs(x="Expected -log10(p)", y="Observed -log10(p)",
       title=paste0("GWAS QQ Plot (lambda = ", lambda, ")"),
       subtitle="Points should lie on the diagonal; deviation above = inflation") +
  theme_bw(base_size=13)
ggsave("gwas_qq_plot.png", width=6, height=6, dpi=200)
cat("Genomic inflation lambda:", lambda, "\n")

Download:

Step 6 — Fine-mapping with SuSiE

Fine-mapping identifies the most likely causal SNP(s) within a GWAS locus. GWAS hits are often in LD with dozens of other SNPs; fine-mapping uses Bayesian models to compute a credible set -- the smallest set of SNPs that contains the causal variant with 95% probability.

R -- SuSiE fine-mapping of a GWAS locus

## ================================================================
## SuSiE (Sum of Single Effects) fine-mapping
## Wang et al. 2020 JRSS-B
## Requires: GWAS summary stats + LD matrix for the locus
## ================================================================
## install.packages("susieR")
library(susieR)
library(dplyr)

## Define locus: top hit region +/- 500 kb
TOP_SNP  <- "S04_12345678"
CHR      <- "chr04"
WIN_HALF <- 5e5  ## 500 kb window

## Load GWAS summary statistics for locus
locus_gwas <- read.table("output/rice3k_trait1_lmm.assoc.txt",
                          header=TRUE, sep="\t") %>%
  filter(chr == CHR,
         abs(ps - 12345678) < WIN_HALF)

cat("SNPs in locus:", nrow(locus_gwas), "\n")

## Compute LD matrix for the locus from individual-level genotypes
## (SuSiE can also accept in-sample LD)
## plink2 --bfile rice3k_snpQC --chr chr04 --from-bp 11845678 --to-bp 12845678
##   --r2-unphased square --out locus_LD
LD_matrix <- as.matrix(read.table("locus_LD.ld", header=FALSE))
rownames(LD_matrix) <- colnames(LD_matrix) <- locus_gwas$rs

## Convert p-values to z-scores for SuSiE-RSS (summary-stat version)
## z = sign(beta) * |qnorm(p/2)|
locus_gwas$z <- sign(locus_gwas$beta) * abs(qnorm(locus_gwas$p_lrt / 2))

## Run SuSiE-RSS
fitted_susie <- susie_rss(
  bhat       = locus_gwas$z,
  shat       = rep(1, nrow(locus_gwas)),   ## z-scores already standardised
  R          = LD_matrix,
  n          = 3000,          ## sample size
  L          = 10,            ## maximum number of causal signals to find
  estimate_residual_variance = TRUE,
  verbose    = TRUE
)

## Extract 95% credible sets
cs <- fitted_susie$sets$cs
cat("\nNumber of credible sets found:", length(cs), "\n")
for(i in seq_along(cs)) {
  cat("CS", i, "-- SNPs:", length(cs[[i]]),
      " -- Covered PIPs:", round(sum(fitted_susie$pip[cs[[i]]]),3), "\n")
  cat("  Top SNP:", locus_gwas$rs[cs[[i]][which.max(fitted_susie$pip[cs[[i]]])]], "\n")
}

## Plot PIP (posterior inclusion probability) for each SNP
pip_df <- data.frame(
  snp  = locus_gwas$rs,
  pos  = locus_gwas$ps,
  pip  = fitted_susie$pip,
  log10p = -log10(locus_gwas$p_lrt)
)

library(ggplot2)
ggplot(pip_df, aes(pos/1e6, pip)) +
  geom_segment(aes(xend=pos/1e6, yend=0), colour="grey70", linewidth=0.4) +
  geom_point(aes(colour=pip > 0.1), size=2) +
  scale_colour_manual(values=c("FALSE"="#6baed6","TRUE"="#e31a1c"),
                      labels=c("PIP < 0.1","PIP > 0.1"), name=NULL) +
  geom_hline(yintercept=0.1, linetype="dashed", colour="red") +
  labs(x="Position (Mb)", y="Posterior inclusion probability (PIP)",
       title=paste0("SuSiE fine-mapping: ", CHR, " locus")) +
  theme_bw(base_size=12)
ggsave("susie_pip_locus.png", width=9, height=4, dpi=200)

Download:

Step 7 — Multi-trait GWAS

Method	What it tests	Power advantage	Tool
MTAG	Multi-trait analysis of GWAS summary statistics	Increases power when traits are genetically correlated; boosts sample size effectively	mtag (Python)
MVMR	Multivariable Mendelian Randomization; instruments multiple traits simultaneously	Disentangles direct vs indirect genetic effects on a phenotype	MendelianRandomization (R)
PHENIX	Jointly models multiple correlated phenotypes in a LMM framework	Power gain proportional to genetic correlation between traits	PHENIX, mvLMM in GEMMA
PheWAS	Scans all phenotypes in a biobank for association with one SNP	Discovers pleiotropic effects; identifies side effects in drug targets	PheWAS (R), UK Biobank PheWeb

R -- GEMMA multivariate LMM (mvLMM) for two correlated traits

## GEMMA multivariate LMM: test association of each SNP
## with two traits simultaneously, accounting for phenotypic correlation
## More powerful than running two separate GWAS when traits are correlated

gemma -bfile rice3k_snpQC \
  -k output/rice3k_kinship.cXX.txt \
  -lmm 4 \
  -n 1 2 \    ## column 1 AND column 2 of phenotype file = two traits
  -o rice3k_mvlmm_trait12
## Output: pnorm test combines evidence across both traits
## A SNP pleiotropically affecting both traits will have a more significant
## mvLMM p-value than either single-trait GWAS alone

Download:

Common Pitfalls

Problem	Symptom	Fix
Spurious hits from population structure	Lambda > 1.2; QQ plot inflated throughout	Add PC1-PC10 as covariates. For highly structured populations (rice indica/japonica) use LMM (GEMMA) instead of GLM. Confirm: run GWAS on simulated random phenotype -- should see lambda = 1.0.
p-value = 0 in output	GEMMA or GAPIT reports p=0 for strongest hits	Numerical underflow. Use -log10(p) directly from the test statistic: -log10(p) = -log10(pchisq(stat, df=1, lower.tail=FALSE)). Or use the log-p output option in GEMMA (-logp flag).
Proximal contamination	Dozens of highly significant SNPs all in the same region	This is normal -- LD block around the causal variant. DO NOT interpret each SNP independently. Clump with PLINK2 --clump or fine-map with SuSiE to identify the credible set.
Sample size too small for rare variants	MAF 1-5% variants never reach significance even with large effects	Aggregate rare variants by gene with SKAT-O or burden test. Minimum N for rare variant GWAS: N > 10,000. For small studies, limit to MAF > 5%.
Phenotype not normally distributed	QQ plot shows deflation in the middle but inflation at the top	Apply inverse-normal transformation before LMM: pheno_norm = qnorm(rank(pheno)/(N+1)). This is standard practice for any skewed quantitative phenotype.

All Tutorials Epigenomics

Advanced GWAS

On this page

GWAS Concepts and Design

Step 1 — GWAS Quality Control and LD Pruning

Step 2 — Population Stratification: Why It Matters

Step 3 — Association Models: GLM vs LMM

Step 4 — Significance Thresholds: Bonferroni vs FDR vs Permutation

Step 5 — Manhattan and QQ Plots

Step 6 — Fine-mapping with SuSiE

Step 7 — Multi-trait GWAS

Common Pitfalls