LD pruning, population stratification, mixed models, Bonferroni vs FDR thresholds, Manhattan and QQ plots, fine-mapping with SuSiE/FINEMAP, and multi-trait GWAS. Every p-value threshold formula shown live.

A genome-wide association study (GWAS) tests whether each SNP in the genome is statistically associated with a phenotype of interest. It is essentially a massive regression: for each SNP, regress phenotype on allele count (0, 1, or 2 copies of the minor allele).
| Component | What it is | Key decision |
|---|---|---|
| Genotype matrix | N individuals x M SNPs; each cell = 0, 1, or 2 copies of minor allele | SNP QC: MAF, missing rate, HWE; LD pruning for PCA |
| Phenotype | Quantitative (height, yield) or binary (case/control) | Normalise continuous traits; use logistic regression for binary |
| Covariates | Variables that confound the SNP-phenotype association | Always include: PC1-PC5 (population structure) + any known batch effects |
| Association model | Regression of phenotype on genotype, corrected for covariates | Linear mixed model (LMM) for structured populations; logistic regression for case/control |
| Multiple testing | M tests performed; Bonferroni threshold = 0.05/M | Bonferroni for genome-wide; FDR 5% for discovery with replication |
## ================================================================
## GWAS QC with PLINK2
## Dataset: Rice 3K SNP array or WGS VCF
## ================================================================
VCF="rice3k_snps.vcf.gz"
## Step 1: Convert VCF to PLINK binary format
plink2 --vcf ${VCF} \
--allow-extra-chr --double-id \
--make-bed --out rice3k_raw
## Step 2: Per-sample QC
plink2 --bfile rice3k_raw \
--mind 0.05 \ ## remove samples with >5% missing genotypes
--het \ ## compute heterozygosity (outliers = contamination)
--out rice3k_sampleqc
## Remove heterozygosity outliers (>3 SD from mean)
Rscript -e '
het <- read.table("rice3k_sampleqc.het", header=TRUE)
mean_f <- mean(het$F); sd_f <- sd(het$F)
fail <- het[abs(het$F - mean_f) > 3 * sd_f, "IID"]
write.table(data.frame(FID=fail, IID=fail), "het_outliers.txt",
row.names=FALSE, col.names=FALSE, quote=FALSE)
'
plink2 --bfile rice3k_raw \
--remove het_outliers.txt \
--make-bed --out rice3k_sampQC
## Step 3: Per-SNP QC
plink2 --bfile rice3k_sampQC \
--maf 0.01 \ ## remove MAF < 1%
--geno 0.05 \ ## remove SNPs with >5% missing rate
--hwe 1e-6 \ ## remove SNPs failing HWE at p<1e-6
--snps-only \ ## SNPs only (no indels)
--make-bed --out rice3k_snpQC
## Step 4: LD pruning for PCA and relatedness (NOT for GWAS itself)
## PCA should be done on LD-pruned SNPs to avoid PC axes being
## dominated by large LD blocks
plink2 --bfile rice3k_snpQC \
--indep-pairwise 50 10 0.2 \ ## 50-SNP window, step 10, r2<0.2
--out rice3k_pruning
## Step 5: PCA on LD-pruned SNPs
plink2 --bfile rice3k_snpQC \
--extract rice3k_pruning.prune.in \
--pca 20 \ ## top 20 PCs (use PC1-PC5 as GWAS covariates)
--out rice3k_pca
## Step 6: Relatedness check
plink2 --bfile rice3k_snpQC \
--extract rice3k_pruning.prune.in \
--king-cutoff 0.125 \ ## remove one of each pair with kinship > 0.125 (2nd degree)
--out rice3k_unrelated
echo "Final unrelated samples: $(wc -l < rice3k_unrelated.king.cutoff.in.id)"
Population stratification is the most common source of false positives in GWAS. If allele frequencies differ between subpopulations AND the phenotype also differs between those subpopulations (for any reason), you get spurious associations. The classic solution is to include top principal components (PCs) as covariates.
| Model | Formula | When to use | Tool |
|---|---|---|---|
| Linear regression (GLM) | y = SNP*beta + covariates + epsilon | Unstructured populations; after removing all related individuals and correcting PCs. Fastest. | PLINK2 --linear |
| Logistic regression | log(p/1-p) = SNP*beta + covariates | Binary phenotypes (case/control). Required for correct p-value calibration with binary traits. | PLINK2 --logistic |
| Linear mixed model (LMM) | y = SNP*beta + Zu + epsilon; u ~ N(0, sigma2*K) | Structured populations (crops, inbred lines, related samples). Accounts for kinship via the genetic relatedness matrix K. | GEMMA, EMMAX, GAPIT3 |
| BGLMM (Bayesian) | Full Bayesian LMM with prior on effect sizes | When effect sizes are of interest; when proportion of causal SNPs is small. Slower but better calibrated. | BGLMM, BayesC, rrBLUP |
## ================================================================
## GEMMA LMM GWAS -- for structured populations (crops, inbreds)
## GEMMA models population structure via the kinship matrix K
## ================================================================
BFILE="rice3k_snpQC" ## PLINK binary
PHENO="rice_phenotypes.txt" ## tab-separated: FID IID trait1 trait2 ...
## Step 1: Compute centered relatedness matrix (kinship)
gemma -bfile ${BFILE} \
-gk 1 \ ## method 1 = centered kinship (Amin et al.)
-o rice3k_kinship
## Output: output/rice3k_kinship.cXX.txt
## Step 2: Run LMM GWAS for each trait
## -lmm 4 = output all test statistics (Wald, likelihood ratio, score test)
gemma -bfile ${BFILE} \
-k output/rice3k_kinship.cXX.txt \
-lmm 4 \
-n 1 \ ## column 1 of phenotype file (first trait)
-o rice3k_trait1_lmm
## Output: output/rice3k_trait1_lmm.assoc.txt
## Step 3: GAPIT3 MLM in R (user-friendly alternative to GEMMA)
## install.packages("devtools")
## devtools::install_github("jiabowang/GAPIT3", force=TRUE)
library(GAPIT3)
myGD <- read.table("rice3k_snps_numeric.txt", header=TRUE) ## numeric genotype matrix
myGM <- read.table("rice3k_snp_map.txt", header=TRUE) ## SNP map: SNP chr pos
myY <- read.table("rice3k_phenotypes.txt", header=TRUE) ## phenotype: taxaID + traits
## Run MLM (Mixed Linear Model) GWAS with PCA covariates
myGAPIT <- GAPIT(
Y = myY,
GD = myGD,
GM = myGM,
model = "MLM",
PCA.total = 5, ## use top 5 PCs as covariates
SNP.MAF = 0.01, ## minor allele frequency filter
file.output = TRUE
)
## Outputs: GAPIT.MLM.*.GWAS.Results.csv -- SNP, Chr, Pos, P.value, MAF, Effect
library(ggplot2); library(dplyr)
## Load GEMMA results
gwas <- read.table("output/rice3k_trait1_lmm.assoc.txt",
header=TRUE, sep="\t")
gwas$LOG10P <- -log10(gwas$p_lrt) ## use likelihood ratio test p-value
## Prepare chromosome positions for Manhattan plot
gwas <- gwas %>%
group_by(chr) %>%
mutate(chr_max = max(ps)) %>%
ungroup() %>%
arrange(chr, ps)
## Cumulative positions
gwas <- gwas %>%
mutate(chr = as.numeric(chr)) %>%
group_by(chr) %>%
mutate(pos_cum = ps + cumsum(c(0, diff(ps)))[1]) %>%
ungroup()
chr_offsets <- gwas %>%
group_by(chr) %>%
summarise(start_cum = min(pos_cum), center = mean(pos_cum), .groups="drop")
## Build cumulative positions across chromosomes
offset <- 0; gwas$pos_plot <- 0
for(ch in sort(unique(gwas$chr))) {
mask <- gwas$chr == ch
gwas$pos_plot[mask] <- gwas$ps[mask] + offset
offset <- offset + max(gwas$ps[mask]) + 5e7
chr_offsets$center[chr_offsets$chr == ch] <- mean(gwas$pos_plot[mask])
}
bonf <- -log10(0.05 / nrow(gwas))
sugg <- -log10(1 / nrow(gwas))
## Manhattan plot
col_odd <- "#2166ac"; col_even <- "#74add1"
gwas$colour <- ifelse(gwas$chr %% 2 == 0, col_even, col_odd)
gwas$colour[gwas$LOG10P > bonf] <- "#e31a1c" ## significant = red
ggplot(gwas, aes(pos_plot, LOG10P, colour=I(colour))) +
geom_point(size=0.4, alpha=0.8) +
geom_hline(yintercept=bonf, linetype="dashed", colour="red", linewidth=0.6) +
geom_hline(yintercept=sugg, linetype="dotted", colour="orange", linewidth=0.5) +
scale_x_continuous(breaks=chr_offsets$center,
labels=paste0("Chr", chr_offsets$chr)) +
scale_y_continuous(name="-log10(p)", expand=c(0.02,0)) +
labs(title="GWAS Manhattan Plot",
subtitle=paste0(nrow(gwas), " SNPs; red dashed = Bonferroni; orange dotted = suggestive")) +
theme_minimal(base_size=11) +
theme(axis.text.x=element_text(angle=45, hjust=1, size=8),
panel.grid.major.x=element_blank())
ggsave("gwas_manhattan.png", width=14, height=5, dpi=200)
## QQ plot (quantile-quantile) -- diagnoses inflation
observed <- sort(-log10(gwas$p_lrt))
expected <- sort(-log10(ppoints(nrow(gwas))))
lambda <- round(median(qchisq(gwas$p_lrt, df=1, lower.tail=FALSE), na.rm=TRUE) / 0.4549, 3)
ggplot(data.frame(expected, observed), aes(expected, observed)) +
geom_point(size=0.5, colour="#2166ac", alpha=0.6) +
geom_abline(slope=1, intercept=0, colour="red", linewidth=0.7) +
labs(x="Expected -log10(p)", y="Observed -log10(p)",
title=paste0("GWAS QQ Plot (lambda = ", lambda, ")"),
subtitle="Points should lie on the diagonal; deviation above = inflation") +
theme_bw(base_size=13)
ggsave("gwas_qq_plot.png", width=6, height=6, dpi=200)
cat("Genomic inflation lambda:", lambda, "\n")
Fine-mapping identifies the most likely causal SNP(s) within a GWAS locus. GWAS hits are often in LD with dozens of other SNPs; fine-mapping uses Bayesian models to compute a credible set -- the smallest set of SNPs that contains the causal variant with 95% probability.
## ================================================================
## SuSiE (Sum of Single Effects) fine-mapping
## Wang et al. 2020 JRSS-B
## Requires: GWAS summary stats + LD matrix for the locus
## ================================================================
## install.packages("susieR")
library(susieR)
library(dplyr)
## Define locus: top hit region +/- 500 kb
TOP_SNP <- "S04_12345678"
CHR <- "chr04"
WIN_HALF <- 5e5 ## 500 kb window
## Load GWAS summary statistics for locus
locus_gwas <- read.table("output/rice3k_trait1_lmm.assoc.txt",
header=TRUE, sep="\t") %>%
filter(chr == CHR,
abs(ps - 12345678) < WIN_HALF)
cat("SNPs in locus:", nrow(locus_gwas), "\n")
## Compute LD matrix for the locus from individual-level genotypes
## (SuSiE can also accept in-sample LD)
## plink2 --bfile rice3k_snpQC --chr chr04 --from-bp 11845678 --to-bp 12845678
## --r2-unphased square --out locus_LD
LD_matrix <- as.matrix(read.table("locus_LD.ld", header=FALSE))
rownames(LD_matrix) <- colnames(LD_matrix) <- locus_gwas$rs
## Convert p-values to z-scores for SuSiE-RSS (summary-stat version)
## z = sign(beta) * |qnorm(p/2)|
locus_gwas$z <- sign(locus_gwas$beta) * abs(qnorm(locus_gwas$p_lrt / 2))
## Run SuSiE-RSS
fitted_susie <- susie_rss(
bhat = locus_gwas$z,
shat = rep(1, nrow(locus_gwas)), ## z-scores already standardised
R = LD_matrix,
n = 3000, ## sample size
L = 10, ## maximum number of causal signals to find
estimate_residual_variance = TRUE,
verbose = TRUE
)
## Extract 95% credible sets
cs <- fitted_susie$sets$cs
cat("\nNumber of credible sets found:", length(cs), "\n")
for(i in seq_along(cs)) {
cat("CS", i, "-- SNPs:", length(cs[[i]]),
" -- Covered PIPs:", round(sum(fitted_susie$pip[cs[[i]]]),3), "\n")
cat(" Top SNP:", locus_gwas$rs[cs[[i]][which.max(fitted_susie$pip[cs[[i]]])]], "\n")
}
## Plot PIP (posterior inclusion probability) for each SNP
pip_df <- data.frame(
snp = locus_gwas$rs,
pos = locus_gwas$ps,
pip = fitted_susie$pip,
log10p = -log10(locus_gwas$p_lrt)
)
library(ggplot2)
ggplot(pip_df, aes(pos/1e6, pip)) +
geom_segment(aes(xend=pos/1e6, yend=0), colour="grey70", linewidth=0.4) +
geom_point(aes(colour=pip > 0.1), size=2) +
scale_colour_manual(values=c("FALSE"="#6baed6","TRUE"="#e31a1c"),
labels=c("PIP < 0.1","PIP > 0.1"), name=NULL) +
geom_hline(yintercept=0.1, linetype="dashed", colour="red") +
labs(x="Position (Mb)", y="Posterior inclusion probability (PIP)",
title=paste0("SuSiE fine-mapping: ", CHR, " locus")) +
theme_bw(base_size=12)
ggsave("susie_pip_locus.png", width=9, height=4, dpi=200)
| Method | What it tests | Power advantage | Tool |
|---|---|---|---|
| MTAG | Multi-trait analysis of GWAS summary statistics | Increases power when traits are genetically correlated; boosts sample size effectively | mtag (Python) |
| MVMR | Multivariable Mendelian Randomization; instruments multiple traits simultaneously | Disentangles direct vs indirect genetic effects on a phenotype | MendelianRandomization (R) |
| PHENIX | Jointly models multiple correlated phenotypes in a LMM framework | Power gain proportional to genetic correlation between traits | PHENIX, mvLMM in GEMMA |
| PheWAS | Scans all phenotypes in a biobank for association with one SNP | Discovers pleiotropic effects; identifies side effects in drug targets | PheWAS (R), UK Biobank PheWeb |
## GEMMA multivariate LMM: test association of each SNP ## with two traits simultaneously, accounting for phenotypic correlation ## More powerful than running two separate GWAS when traits are correlated gemma -bfile rice3k_snpQC \ -k output/rice3k_kinship.cXX.txt \ -lmm 4 \ -n 1 2 \ ## column 1 AND column 2 of phenotype file = two traits -o rice3k_mvlmm_trait12 ## Output: pnorm test combines evidence across both traits ## A SNP pleiotropically affecting both traits will have a more significant ## mvLMM p-value than either single-trait GWAS alone
| Problem | Symptom | Fix |
|---|---|---|
| Spurious hits from population structure | Lambda > 1.2; QQ plot inflated throughout | Add PC1-PC10 as covariates. For highly structured populations (rice indica/japonica) use LMM (GEMMA) instead of GLM. Confirm: run GWAS on simulated random phenotype -- should see lambda = 1.0. |
| p-value = 0 in output | GEMMA or GAPIT reports p=0 for strongest hits | Numerical underflow. Use -log10(p) directly from the test statistic: -log10(p) = -log10(pchisq(stat, df=1, lower.tail=FALSE)). Or use the log-p output option in GEMMA (-logp flag). |
| Proximal contamination | Dozens of highly significant SNPs all in the same region | This is normal -- LD block around the causal variant. DO NOT interpret each SNP independently. Clump with PLINK2 --clump or fine-map with SuSiE to identify the credible set. |
| Sample size too small for rare variants | MAF 1-5% variants never reach significance even with large effects | Aggregate rare variants by gene with SKAT-O or burden test. Minimum N for rare variant GWAS: N > 10,000. For small studies, limit to MAF > 5%. |
| Phenotype not normally distributed | QQ plot shows deflation in the middle but inflation at the top | Apply inverse-normal transformation before LMM: pheno_norm = qnorm(rank(pheno)/(N+1)). This is standard practice for any skewed quantitative phenotype. |