PLINK QC · population stratification · relatedness filtering · GEMMA linear mixed model · Manhattan & QQ plots · LD clumping · SNP annotation
A Genome-Wide Association Study (GWAS) tests millions of SNPs across the genome for statistical association with a phenotypic trait. The key statistical challenge is controlling for population stratification (ancestry differences between cases and controls) and cryptic relatedness (hidden family relationships). The linear mixed model (LMM) approach used in GEMMA solves both problems simultaneously.
We use the Arabidopsis thaliana 1001 Genomes flowering time GWAS — a quantitative trait GWAS in a plant model system. Flowering time (FT10 and FT16 — days to flower at 10°C and 16°C) has well-validated GWAS hits including FLC and FRI.
| Property | Value |
|---|---|
| Organism | Arabidopsis thaliana |
| Phenotype | Flowering time at 10°C (FT10) and 16°C (FT16) |
| Accession | 1001 Genomes / GWAPP — Atwell et al. 2010 Nature |
| Samples | 199 accessions |
| SNPs | ~214,000 SNPs (250K SNP array) |
| Expected top hits | FRI (Chr4), FLC (Chr5) — the two major vernalisation genes |
######################################################################
## STEP 1: Comprehensive QC with PLINK2
## Run QC in stages: first sample QC, then SNP QC
## (remove poor-quality samples first, THEN filter SNPs on the cleaned sample set)
## Reversing this order gives slightly different results because SNP missingness
## changes when poor-quality samples are removed.
######################################################################
module load plink2/2.0
module load plink/1.9 # needed for some legacy commands
## --- Download Arabidopsis 1001G GWAS data ---
wget -O 0_raw/arabidopsis_1001G.vcf.gz \
https://1001genomes.org/data/GMI-MPI/releases/v3.1/intersection_snp_short_indel_vcf/all_chromosomes_except_mitochondria_chloroplast.vcf.gz
## Convert VCF to PLINK binary format
plink2 \
--vcf 0_raw/arabidopsis_1001G.vcf.gz \
--make-bed \
--out 1_qc/arabidopsis_raw \
--allow-extra-chr # needed for non-human chromosome names
## ================================================================
## STAGE 1: SAMPLE QC
## ================================================================
## --- 1a: Sample missingness ---
## --missing : compute per-sample and per-SNP missingness rates
## Creates: .smiss (sample) and .vmiss (variant) missingness files
plink2 \
--bfile 1_qc/arabidopsis_raw \
--missing \
--allow-extra-chr \
--out 1_qc/arabidopsis_missing
## Remove samples with > 5% missing genotypes
## --mind 0.05 : exclude samples with genotype call rate < 95%
plink2 \
--bfile 1_qc/arabidopsis_raw \
--mind 0.05 \
--allow-extra-chr \
--make-bed \
--out 1_qc/arabidopsis_sampqc
## --- 1b: Heterozygosity outliers ---
## Inbreeding F statistic: measures excess homozygosity
## For inbred organisms (Arabidopsis is selfing, F > 0.9 expected),
## use a wider cutoff; for outbred humans, remove F > 0.2 or F < -0.2
## --het : compute observed vs expected heterozygosity per sample
plink --bfile 1_qc/arabidopsis_sampqc \
--het --allow-no-sex --allow-extra-chr \
--out 1_qc/arabidopsis_het
## Plot heterozygosity in R and remove outliers (> 3 SD from mean)
## This is done in the R visualisation step below
## ================================================================
## STAGE 2: SNP QC
## ================================================================
## Run all SNP filters in one command (order matters inside PLINK):
## --geno 0.05 : remove SNPs with > 5% missing genotypes across samples
## (monomorphic variants and array failures tend to have high missingness)
## --maf 0.01 : remove SNPs with minor allele frequency < 1%
## (very rare variants have very low statistical power and are often errors)
## --hwe 1e-6 : remove SNPs deviating from Hardy-Weinberg Equilibrium
## (HWE p < 1e-6 suggests genotyping error or strong selection/inbreeding)
## For highly inbred organisms (Arabidopsis), use very permissive cutoff (1e-10)
## or skip HWE filtering entirely
## --snps-only just-acgt : keep only SNPs (no indels), only ACGT alleles
plink2 \
--bfile 1_qc/arabidopsis_sampqc \
--geno 0.05 \
--maf 0.01 \
--hwe 1e-10 \
--snps-only just-acgt \
--allow-extra-chr \
--make-bed \
--out 1_qc/arabidopsis_snpqc
## --- Report QC summary ---
echo "=== QC SUMMARY ==="
echo "Raw samples: $(wc -l < 1_qc/arabidopsis_raw.fam)"
echo "Raw SNPs: $(wc -l < 1_qc/arabidopsis_raw.bim)"
echo "After QC samples: $(wc -l < 1_qc/arabidopsis_snpqc.fam)"
echo "After QC SNPs: $(wc -l < 1_qc/arabidopsis_snpqc.bim)"
| 📊--maf 0.01 | SNPs with minor allele frequency below 1% have very low statistical power in typical GWAS sample sizes. Even if associated, they are hard to replicate. Also, rare alleles are more likely to be genotyping errors that are hard to detect. Remove them for the main scan; separate rare-variant analyses (SKAT, burden tests) handle rare variants differently. |
| 🎯--hwe 1e-10 | Hardy-Weinberg Equilibrium deviation can indicate genotyping error. The standard human GWAS cutoff is 1e-6 (controls only). For Arabidopsis (a selfing, near-completely homozygous species), nearly all SNPs deviate from HWE because heterozygosity is intrinsically near-zero. Use a very permissive threshold or skip HWE filtering for inbred organisms. |
######################################################################
## STEP 2: Assess population stratification using PCA
## Principal components (PCs) summarise ancestry variation.
## In GEMMA LMM: the kinship matrix accounts for stratification implicitly.
## But visualising PCA first helps you understand your sample composition
## and identify any outliers or unexpected clustering patterns.
######################################################################
## --- LD prune before PCA ---
## PCA on all SNPs is dominated by LD blocks (correlated SNPs count multiple times)
## LD pruning ensures each PCA component reflects independent genomic regions
plink2 \
--bfile 1_qc/arabidopsis_snpqc \
--indep-pairwise 50 10 0.1 \
--allow-extra-chr \
--out 1_qc/ld_prune
plink2 \
--bfile 1_qc/arabidopsis_snpqc \
--extract 1_qc/ld_prune.prune.in \
--allow-extra-chr \
--make-bed \
--out 1_qc/arabidopsis_ldpruned
## --- Compute PCA (top 20 PCs) ---
## --pca 20 : compute top 20 principal components
## Output: .eigenvec (sample coordinates) and .eigenval (variance explained)
plink2 \
--bfile 1_qc/arabidopsis_ldpruned \
--pca 20 \
--allow-extra-chr \
--out 2_pca/arabidopsis_pca
## --- Plot PCA in R ---
## This section runs in R (save as a separate .R script and run with Rscript)
cat << 'RSCRIPT' > 2_pca/plot_pca.R
library(tidyverse)
## Read PCA eigenvectors
pca <- read.table("2_pca/arabidopsis_pca.eigenvec",
header=TRUE)
## Read eigenvalues (variance explained by each PC)
eigenval <- scan("2_pca/arabidopsis_pca.eigenval")
pct_var <- round(eigenval / sum(eigenval) * 100, 1)
## Read sample metadata (geographic origin)
## Download from 1001genomes.org metadata table
meta <- read.csv("0_raw/1001G_metadata.csv") # sample ID + geographic region
pca <- left_join(pca, meta, by=c("#IID"="sample_id"))
## PC1 vs PC2 plot coloured by geographic region
ggplot(pca, aes(PC1, PC2, color=region)) +
geom_point(size=2, alpha=0.8) +
labs(x=paste0("PC1 (", pct_var[1], "% variance)"),
y=paste0("PC2 (", pct_var[2], "% variance)"),
title="Arabidopsis 1001G — population structure (PCA)",
color="Geographic region") +
theme_bw(base_size=12)
ggsave("2_pca/pca_population_structure.png", width=9, height=6, dpi=150)
RSCRIPT
Rscript 2_pca/plot_pca.R
######################################################################
## STEP 3: Compute the genetic relatedness (kinship) matrix
## The kinship matrix captures all pairwise genetic similarities.
## GEMMA uses this matrix as a random effect in the linear mixed model,
## simultaneously controlling for both population structure AND
## all cryptic relatedness between samples.
######################################################################
module load gemma/0.98.5
## First, convert PLINK files to BIMBAM format (GEMMA's preferred input)
## OR use PLINK binary format directly with GEMMA's -bfile flag
## --- Option A: Use PLINK binary files directly with GEMMA ---
## Prepare phenotype file: one row per sample, one column per phenotype
## Format: mean-centred phenotype values, one per line (same order as .fam)
## Extract FT10 phenotype from metadata and match sample order in .fam
cut -d' ' -f2 1_qc/arabidopsis_snpqc.fam > 3_gemma/sample_order.txt
## Create phenotype file from 1001G metadata (FT10 = days to flower at 10°C)
## Match sample order to .fam file order
## --- Compute centered relatedness matrix ---
## gemma -gk 1 : compute centered genetic relatedness matrix (GRM)
## Centered GRM: subtract column means from the genotype matrix before computing
## This is equivalent to GCTA's GRM; recommended for mixed models
## gemma -gk 2 : compute standardized GRM
## Standardized: also divide by SD per SNP (accounts for MAF differences)
## -bfile : PLINK binary file prefix
## -o : output file prefix
## -outdir: output directory
gemma \
-bfile 1_qc/arabidopsis_snpqc \
-gk 1 \
-o arabidopsis_kinship \
-outdir 3_gemma/
## Output: 3_gemma/arabidopsis_kinship.cXX.txt
## This is a (n × n) matrix where n = number of samples
## Diagonal = estimated relatedness of each individual to itself (~1.0)
## Off-diagonal = estimated pairwise relatedness (range: -1 to 1)
echo "Kinship matrix computed."
echo "Dimensions: $(head -1 3_gemma/arabidopsis_kinship.cXX.txt | wc -w) × $(wc -l < 3_gemma/arabidopsis_kinship.cXX.txt) samples"
######################################################################
## STEP 4: Run GEMMA linear mixed model GWAS
## GEMMA -lm : standard linear model (no kinship correction; for comparison)
## GEMMA -lmm : linear mixed model (uses kinship matrix to control confounding)
## We run both to demonstrate the benefit of the LMM approach.
######################################################################
## --- Prepare phenotype file ---
## GEMMA phenotype file: one numeric value per line (same order as .fam)
## NA or -9 = missing phenotype
## Replace PLINK's -9 missing value with NA
awk '{print ($6 == -9) ? "NA" : $6}' 1_qc/arabidopsis_snpqc.fam > 3_gemma/phenotype_FT10.txt
## --- Run linear mixed model GWAS ---
## -bfile : PLINK binary input files
## -k : kinship matrix computed in Step 3
## -lmm 4 : LMM test type:
## 1 = Wald test
## 2 = Likelihood ratio test
## 3 = Score test
## 4 = ALL three tests (recommended: use Wald p-value for final results)
## -pheno : phenotype file (one trait per column)
## -n 1 : use the 1st phenotype column (if file has multiple traits, specify here)
## -o : output prefix
## -outdir : output directory
## -maf 0.01 : additional MAF filter within GEMMA (redundant but safe)
gemma \
-bfile 1_qc/arabidopsis_snpqc \
-k 3_gemma/arabidopsis_kinship.cXX.txt \
-lmm 4 \
-pheno 3_gemma/phenotype_FT10.txt \
-n 1 \
-o arabidopsis_FT10_lmm \
-outdir 3_gemma/ \
-maf 0.01
## --- Inspect GEMMA output ---
## Columns: chr, rs, ps, n_miss, allele1, allele0, af, beta, se, logl_H1, l_remle, l_mle, p_wald, p_lrt, p_score
## p_wald = Wald test p-value (use this as the primary p-value)
## beta = effect size (change in phenotype per copy of allele1)
## se = standard error of beta
head 3_gemma/arabidopsis_FT10_lmm.assoc.txt
echo "Total SNPs tested: $(wc -l < 3_gemma/arabidopsis_FT10_lmm.assoc.txt)"
## Bonferroni significance threshold for this dataset
## = 0.05 / number of SNPs tested
N_SNPS=$(wc -l < 3_gemma/arabidopsis_FT10_lmm.assoc.txt)
echo "Bonferroni threshold: $(python3 -c "print(0.05/${N_SNPS})")"
## Count significant hits
awk -v thresh=$(python3 -c "print(0.05/${N_SNPS})") \
'NR>1 && $13 < thresh {print}' \
3_gemma/arabidopsis_FT10_lmm.assoc.txt | wc -l
| 📈Why LMM over standard linear regression | Standard linear regression (PLINK --assoc) treats all samples as independent. In reality, related individuals share alleles due to common ancestry, creating correlation in their phenotypes. The LMM models this correlation as a random effect using the kinship matrix — it “subtracts out” the genetic similarity between samples before testing each SNP. This dramatically reduces false positives from population structure. |
| 🎯Bonferroni vs FDR | GWAS typically uses genome-wide significance of p < 5×10⁻⁸ for humans (based on ~1 million independent SNPs after LD). For Arabidopsis with ~214k SNPs, the Bonferroni threshold is ~2.3×10⁻⁷. FDR correction is also valid and more powerful when many true signals exist (e.g. highly polygenic traits). |
######################################################################
## STEP 5: Visualise GWAS results
## Manhattan plot: -log10(p-value) for each SNP across chromosomes
## QQ plot: observed vs expected p-values under null hypothesis
## Points deviating above the diagonal = true associations (or inflation)
## Genomic inflation factor (lambda): measures test statistic inflation
## lambda = 1.0 : no inflation (well-controlled GWAS)
## lambda > 1.05 : inflation (population stratification not fully corrected)
######################################################################
library(tidyverse)
library(qqman) # Manhattan and QQ plots for GWAS
## --- Load GEMMA results ---
gwas <- read.table("3_gemma/arabidopsis_FT10_lmm.assoc.txt",
header=TRUE, sep="\t") %>%
rename(CHR=chr, BP=ps, SNP=rs, P=p_wald) %>%
filter(!is.na(P), P > 0) # remove missing/zero p-values
## --- Genomic inflation factor (lambda) ---
## lambda: ratio of observed median chi-squared statistic to expected (0.456)
## chi-squared = qchisq(1-p, df=1) for each SNP
chisq <- qchisq(1 - gwas$P, df=1)
lambda <- median(chisq, na.rm=TRUE) / 0.456
cat("Genomic inflation factor (lambda):", round(lambda, 3), "\n")
## lambda < 1.05 = well-controlled; > 1.1 = significant inflation
## --- Manhattan plot ---
## Bonferroni threshold for this dataset
bonf_threshold <- -log10(0.05 / nrow(gwas))
png("4_plots/manhattan_FT10.png", width=1600, height=700, res=120)
manhattan(
gwas,
chr = "CHR", # chromosome column name
bp = "BP", # base pair position column name
snp = "SNP", # SNP ID column name
p = "P", # p-value column name
suggestiveline = FALSE,
genomewideline = bonf_threshold, # Bonferroni threshold line
col = c("#4575b4","#d73027"), # alternating chromosome colours
cex = 0.5, # point size
main = "Arabidopsis FT10 GWAS (GEMMA LMM)"
)
dev.off()
## --- QQ plot ---
png("4_plots/qqplot_FT10.png", width=700, height=700, res=120)
qq(gwas$P, main=paste0("QQ plot (lambda = ", round(lambda,3), ")"))
dev.off()
## --- Top hits table ---
sig_snps <- gwas %>%
filter(P < 0.05 / nrow(gwas)) %>%
arrange(P) %>%
head(20)
cat("\nTop 20 significant SNPs:\n")
print(sig_snps[, c("CHR","BP","SNP","P")])
write.csv(sig_snps, "4_plots/top_hits_FT10.csv", row.names=FALSE)
######################################################################
## STEP 6: LD clumping and functional annotation of significant SNPs
## LD clumping: among all significant SNPs, identify independent signals
## (SNPs in high LD are not independent hits — they reflect the same signal)
## Annotation: link lead SNPs to nearby genes using biomaRt
######################################################################
## --- LD clumping with PLINK ---
## --clump : take a GWAS results file and identify independent signal clusters
## --clump-p1 : primary significance threshold (lead SNP p-value)
## --clump-p2 : secondary threshold (SNPs clumped around lead SNP)
## --clump-r2 : r2 threshold — SNPs with r2 > 0.1 to lead SNP are clumped together
## --clump-kb : window size — only clump SNPs within 500 kb of the lead SNP
## --clump-field : column name for p-values in the GWAS result file
plink \
--bfile 1_qc/arabidopsis_snpqc \
--clump 3_gemma/arabidopsis_FT10_lmm.assoc.txt \
--clump-p1 $(python3 -c "print(0.05/214000)") \
--clump-p2 0.01 \
--clump-r2 0.1 \
--clump-kb 500 \
--clump-field p_wald \
--clump-snp-field rs \
--allow-extra-chr \
--out 4_plots/arabidopsis_clumped
## View independent signals
echo "Independent GWAS signals after LD clumping:"
cat 4_plots/arabidopsis_clumped.clumped
## --- Annotate lead SNPs with biomaRt ---
## Find the nearest gene to each lead SNP
cat << 'RSCRIPT' > 4_plots/annotate_snps.R
library(tidyverse)
library(biomaRt)
## Read clumped lead SNPs
lead_snps <- read.table("4_plots/arabidopsis_clumped.clumped",
header=TRUE) %>%
dplyr::select(CHR, BP, SNP, P)
## Connect to Ensembl Biomart for Arabidopsis
## dataset = "athaliana_eg_gene" : A. thaliana TAIR10 genes
mart <- useMart("plants_mart",
dataset = "athaliana_eg_gene",
host = "https://plants.ensembl.org")
## For each lead SNP: find genes within ±50 kb
results <- lapply(seq_len(nrow(lead_snps)), function(i) {
row <- lead_snps[i, ]
genes <- getBM(
attributes = c("ensembl_gene_id","external_gene_name",
"chromosome_name","start_position","end_position",
"description"),
filters = c("chromosome_name","start","end"),
values = list(row$CHR, max(1, row$BP - 50000), row$BP + 50000),
mart = mart
)
if(nrow(genes) > 0) mutate(genes, lead_snp=row$SNP, snp_p=row$P)
else NULL
}) %>% bind_rows()
cat("Annotated lead SNPs:\n")
print(results[, c("lead_snp","snp_p","external_gene_name","description")] %>%
arrange(snp_p))
write.csv(results, "4_plots/lead_snp_gene_annotation.csv", row.names=FALSE)
RSCRIPT
Rscript 4_plots/annotate_snps.R