GWAS: Genome-Wide Association Study

PLINK QC · population stratification · relatedness filtering · GEMMA linear mixed model · Manhattan & QQ plots · LD clumping · SNP annotation

~150 min Advanced Bash / R
Byte

Overview

A Genome-Wide Association Study (GWAS) tests millions of SNPs across the genome for statistical association with a phenotypic trait. The key statistical challenge is controlling for population stratification (ancestry differences between cases and controls) and cryptic relatedness (hidden family relationships). The linear mixed model (LMM) approach used in GEMMA solves both problems simultaneously.

Byte
The GWAS QC pipeline — why each step matters:
  1. Sample QC: Remove samples with high missingness, sex discordance, outlier heterozygosity, or relatedness — these inflate false positives
  2. SNP QC: Remove SNPs with high missingness, low MAF, or deviation from Hardy-Weinberg equilibrium — these are likely genotyping errors
  3. Population stratification: Samples from different ancestries have different allele frequencies. If cases are enriched for one ancestry, thousands of SNPs will appear associated due to ancestry alone
  4. Mixed model: Models genetic relatedness as a random effect, simultaneously controlling for population structure AND cryptic relatedness

Dataset

We use the Arabidopsis thaliana 1001 Genomes flowering time GWAS — a quantitative trait GWAS in a plant model system. Flowering time (FT10 and FT16 — days to flower at 10°C and 16°C) has well-validated GWAS hits including FLC and FRI.

PropertyValue
OrganismArabidopsis thaliana
PhenotypeFlowering time at 10°C (FT10) and 16°C (FT16)
Accession1001 Genomes / GWAPP — Atwell et al. 2010 Nature
Samples199 accessions
SNPs~214,000 SNPs (250K SNP array)
Expected top hitsFRI (Chr4), FLC (Chr5) — the two major vernalisation genes

Step 1 — PLINK QC Filters

Bash — PLINK sample and SNP quality control
Download:
QC thresholds decoded
📊--maf 0.01SNPs with minor allele frequency below 1% have very low statistical power in typical GWAS sample sizes. Even if associated, they are hard to replicate. Also, rare alleles are more likely to be genotyping errors that are hard to detect. Remove them for the main scan; separate rare-variant analyses (SKAT, burden tests) handle rare variants differently.
🎯--hwe 1e-10Hardy-Weinberg Equilibrium deviation can indicate genotyping error. The standard human GWAS cutoff is 1e-6 (controls only). For Arabidopsis (a selfing, near-completely homozygous species), nearly all SNPs deviate from HWE because heterozygosity is intrinsically near-zero. Use a very permissive threshold or skip HWE filtering for inbred organisms.

Step 2 — Population Stratification with PCA

Bash + R — PCA for ancestry inference and stratification control
######################################################################
## STEP 2: Assess population stratification using PCA
## Principal components (PCs) summarise ancestry variation.
## In GEMMA LMM: the kinship matrix accounts for stratification implicitly.
## But visualising PCA first helps you understand your sample composition
## and identify any outliers or unexpected clustering patterns.
######################################################################

## --- LD prune before PCA ---
## PCA on all SNPs is dominated by LD blocks (correlated SNPs count multiple times)
## LD pruning ensures each PCA component reflects independent genomic regions
plink2 \
    --bfile 1_qc/arabidopsis_snpqc \
    --indep-pairwise 50 10 0.1 \
    --allow-extra-chr \
    --out 1_qc/ld_prune

plink2 \
    --bfile 1_qc/arabidopsis_snpqc \
    --extract 1_qc/ld_prune.prune.in \
    --allow-extra-chr \
    --make-bed \
    --out 1_qc/arabidopsis_ldpruned

## --- Compute PCA (top 20 PCs) ---
## --pca 20 : compute top 20 principal components
## Output: .eigenvec (sample coordinates) and .eigenval (variance explained)
plink2 \
    --bfile 1_qc/arabidopsis_ldpruned \
    --pca 20 \
    --allow-extra-chr \
    --out 2_pca/arabidopsis_pca

## --- Plot PCA in R ---
## This section runs in R (save as a separate .R script and run with Rscript)
cat << 'RSCRIPT' > 2_pca/plot_pca.R
library(tidyverse)

## Read PCA eigenvectors
pca <- read.table("2_pca/arabidopsis_pca.eigenvec",
                  header=TRUE)
## Read eigenvalues (variance explained by each PC)
eigenval <- scan("2_pca/arabidopsis_pca.eigenval")
pct_var  <- round(eigenval / sum(eigenval) * 100, 1)

## Read sample metadata (geographic origin)
## Download from 1001genomes.org metadata table
meta <- read.csv("0_raw/1001G_metadata.csv")   # sample ID + geographic region
pca  <- left_join(pca, meta, by=c("#IID"="sample_id"))

## PC1 vs PC2 plot coloured by geographic region
ggplot(pca, aes(PC1, PC2, color=region)) +
    geom_point(size=2, alpha=0.8) +
    labs(x=paste0("PC1 (", pct_var[1], "% variance)"),
         y=paste0("PC2 (", pct_var[2], "% variance)"),
         title="Arabidopsis 1001G — population structure (PCA)",
         color="Geographic region") +
    theme_bw(base_size=12)
ggsave("2_pca/pca_population_structure.png", width=9, height=6, dpi=150)
RSCRIPT

Rscript 2_pca/plot_pca.R
Download:

Step 3 — Relatedness & Kinship Matrix

Bash — Compute kinship matrix with GEMMA for mixed model
######################################################################
## STEP 3: Compute the genetic relatedness (kinship) matrix
## The kinship matrix captures all pairwise genetic similarities.
## GEMMA uses this matrix as a random effect in the linear mixed model,
## simultaneously controlling for both population structure AND
## all cryptic relatedness between samples.
######################################################################

module load gemma/0.98.5

## First, convert PLINK files to BIMBAM format (GEMMA's preferred input)
## OR use PLINK binary format directly with GEMMA's -bfile flag

## --- Option A: Use PLINK binary files directly with GEMMA ---
## Prepare phenotype file: one row per sample, one column per phenotype
## Format: mean-centred phenotype values, one per line (same order as .fam)
## Extract FT10 phenotype from metadata and match sample order in .fam
cut -d' ' -f2 1_qc/arabidopsis_snpqc.fam > 3_gemma/sample_order.txt
## Create phenotype file from 1001G metadata (FT10 = days to flower at 10°C)
## Match sample order to .fam file order

## --- Compute centered relatedness matrix ---
## gemma -gk 1 : compute centered genetic relatedness matrix (GRM)
##   Centered GRM: subtract column means from the genotype matrix before computing
##   This is equivalent to GCTA's GRM; recommended for mixed models
## gemma -gk 2 : compute standardized GRM
##   Standardized: also divide by SD per SNP (accounts for MAF differences)
## -bfile : PLINK binary file prefix
## -o     : output file prefix
## -outdir: output directory
gemma \
    -bfile 1_qc/arabidopsis_snpqc \
    -gk 1 \
    -o arabidopsis_kinship \
    -outdir 3_gemma/

## Output: 3_gemma/arabidopsis_kinship.cXX.txt
## This is a (n × n) matrix where n = number of samples
## Diagonal = estimated relatedness of each individual to itself (~1.0)
## Off-diagonal = estimated pairwise relatedness (range: -1 to 1)

echo "Kinship matrix computed."
echo "Dimensions: $(head -1 3_gemma/arabidopsis_kinship.cXX.txt | wc -w) × $(wc -l < 3_gemma/arabidopsis_kinship.cXX.txt) samples"
Download:

Step 4 — GEMMA Linear Mixed Model Association

Bash — GEMMA BSLMM and univariate LMM for GWAS
######################################################################
## STEP 4: Run GEMMA linear mixed model GWAS
## GEMMA -lm : standard linear model (no kinship correction; for comparison)
## GEMMA -lmm : linear mixed model (uses kinship matrix to control confounding)
## We run both to demonstrate the benefit of the LMM approach.
######################################################################

## --- Prepare phenotype file ---
## GEMMA phenotype file: one numeric value per line (same order as .fam)
## NA or -9 = missing phenotype
## Replace PLINK's -9 missing value with NA
awk '{print ($6 == -9) ? "NA" : $6}' 1_qc/arabidopsis_snpqc.fam > 3_gemma/phenotype_FT10.txt

## --- Run linear mixed model GWAS ---
## -bfile  : PLINK binary input files
## -k      : kinship matrix computed in Step 3
## -lmm 4  : LMM test type:
##   1 = Wald test
##   2 = Likelihood ratio test
##   3 = Score test
##   4 = ALL three tests (recommended: use Wald p-value for final results)
## -pheno  : phenotype file (one trait per column)
## -n 1    : use the 1st phenotype column (if file has multiple traits, specify here)
## -o      : output prefix
## -outdir : output directory
## -maf 0.01 : additional MAF filter within GEMMA (redundant but safe)
gemma \
    -bfile 1_qc/arabidopsis_snpqc \
    -k 3_gemma/arabidopsis_kinship.cXX.txt \
    -lmm 4 \
    -pheno 3_gemma/phenotype_FT10.txt \
    -n 1 \
    -o arabidopsis_FT10_lmm \
    -outdir 3_gemma/ \
    -maf 0.01

## --- Inspect GEMMA output ---
## Columns: chr, rs, ps, n_miss, allele1, allele0, af, beta, se, logl_H1, l_remle, l_mle, p_wald, p_lrt, p_score
## p_wald = Wald test p-value (use this as the primary p-value)
## beta   = effect size (change in phenotype per copy of allele1)
## se     = standard error of beta
head 3_gemma/arabidopsis_FT10_lmm.assoc.txt
echo "Total SNPs tested: $(wc -l < 3_gemma/arabidopsis_FT10_lmm.assoc.txt)"

## Bonferroni significance threshold for this dataset
## = 0.05 / number of SNPs tested
N_SNPS=$(wc -l < 3_gemma/arabidopsis_FT10_lmm.assoc.txt)
echo "Bonferroni threshold: $(python3 -c "print(0.05/${N_SNPS})")"

## Count significant hits
awk -v thresh=$(python3 -c "print(0.05/${N_SNPS})") \
    'NR>1 && $13 < thresh {print}' \
    3_gemma/arabidopsis_FT10_lmm.assoc.txt | wc -l
Download:
GEMMA LMM decoded
📈Why LMM over standard linear regressionStandard linear regression (PLINK --assoc) treats all samples as independent. In reality, related individuals share alleles due to common ancestry, creating correlation in their phenotypes. The LMM models this correlation as a random effect using the kinship matrix — it “subtracts out” the genetic similarity between samples before testing each SNP. This dramatically reduces false positives from population structure.
🎯Bonferroni vs FDRGWAS typically uses genome-wide significance of p < 5×10⁻⁸ for humans (based on ~1 million independent SNPs after LD). For Arabidopsis with ~214k SNPs, the Bonferroni threshold is ~2.3×10⁻⁷. FDR correction is also valid and more powerful when many true signals exist (e.g. highly polygenic traits).

Step 5 — Manhattan & QQ Plots

R — Manhattan plot, QQ plot, and genomic inflation factor
######################################################################
## STEP 5: Visualise GWAS results
## Manhattan plot: -log10(p-value) for each SNP across chromosomes
## QQ plot: observed vs expected p-values under null hypothesis
##   Points deviating above the diagonal = true associations (or inflation)
## Genomic inflation factor (lambda): measures test statistic inflation
##   lambda = 1.0 : no inflation (well-controlled GWAS)
##   lambda > 1.05 : inflation (population stratification not fully corrected)
######################################################################

library(tidyverse)
library(qqman)     # Manhattan and QQ plots for GWAS

## --- Load GEMMA results ---
gwas <- read.table("3_gemma/arabidopsis_FT10_lmm.assoc.txt",
                   header=TRUE, sep="\t") %>%
    rename(CHR=chr, BP=ps, SNP=rs, P=p_wald) %>%
    filter(!is.na(P), P > 0)   # remove missing/zero p-values

## --- Genomic inflation factor (lambda) ---
## lambda: ratio of observed median chi-squared statistic to expected (0.456)
## chi-squared = qchisq(1-p, df=1) for each SNP
chisq     <- qchisq(1 - gwas$P, df=1)
lambda    <- median(chisq, na.rm=TRUE) / 0.456
cat("Genomic inflation factor (lambda):", round(lambda, 3), "\n")
## lambda < 1.05 = well-controlled; > 1.1 = significant inflation

## --- Manhattan plot ---
## Bonferroni threshold for this dataset
bonf_threshold <- -log10(0.05 / nrow(gwas))

png("4_plots/manhattan_FT10.png", width=1600, height=700, res=120)
manhattan(
    gwas,
    chr   = "CHR",        # chromosome column name
    bp    = "BP",         # base pair position column name
    snp   = "SNP",        # SNP ID column name
    p     = "P",          # p-value column name
    suggestiveline = FALSE,
    genomewideline = bonf_threshold,   # Bonferroni threshold line
    col   = c("#4575b4","#d73027"),    # alternating chromosome colours
    cex   = 0.5,          # point size
    main  = "Arabidopsis FT10 GWAS (GEMMA LMM)"
)
dev.off()

## --- QQ plot ---
png("4_plots/qqplot_FT10.png", width=700, height=700, res=120)
qq(gwas$P, main=paste0("QQ plot (lambda = ", round(lambda,3), ")"))
dev.off()

## --- Top hits table ---
sig_snps <- gwas %>%
    filter(P < 0.05 / nrow(gwas)) %>%
    arrange(P) %>%
    head(20)

cat("\nTop 20 significant SNPs:\n")
print(sig_snps[, c("CHR","BP","SNP","P")])
write.csv(sig_snps, "4_plots/top_hits_FT10.csv", row.names=FALSE)
Download:
🍀
Expected Arabidopsis result: The Manhattan plot for FT10 should show a very strong peak on Chromosome 4 at FRI (FRIGIDA) and Chromosome 5 at FLC (FLOWERING LOCUS C) — the two master regulators of vernalisation-dependent flowering. These are the most replicated plant GWAS hits and were published in Atwell et al. 2010 Nature.

Step 6 — LD Clumping & SNP Annotation

Bash + R — LD clumping to identify independent signals, annotation
######################################################################
## STEP 6: LD clumping and functional annotation of significant SNPs
## LD clumping: among all significant SNPs, identify independent signals
## (SNPs in high LD are not independent hits — they reflect the same signal)
## Annotation: link lead SNPs to nearby genes using biomaRt
######################################################################

## --- LD clumping with PLINK ---
## --clump : take a GWAS results file and identify independent signal clusters
## --clump-p1 : primary significance threshold (lead SNP p-value)
## --clump-p2 : secondary threshold (SNPs clumped around lead SNP)
## --clump-r2 : r2 threshold — SNPs with r2 > 0.1 to lead SNP are clumped together
## --clump-kb : window size — only clump SNPs within 500 kb of the lead SNP
## --clump-field : column name for p-values in the GWAS result file
plink \
    --bfile 1_qc/arabidopsis_snpqc \
    --clump 3_gemma/arabidopsis_FT10_lmm.assoc.txt \
    --clump-p1 $(python3 -c "print(0.05/214000)") \
    --clump-p2 0.01 \
    --clump-r2 0.1 \
    --clump-kb 500 \
    --clump-field p_wald \
    --clump-snp-field rs \
    --allow-extra-chr \
    --out 4_plots/arabidopsis_clumped

## View independent signals
echo "Independent GWAS signals after LD clumping:"
cat 4_plots/arabidopsis_clumped.clumped

## --- Annotate lead SNPs with biomaRt ---
## Find the nearest gene to each lead SNP
cat << 'RSCRIPT' > 4_plots/annotate_snps.R
library(tidyverse)
library(biomaRt)

## Read clumped lead SNPs
lead_snps <- read.table("4_plots/arabidopsis_clumped.clumped",
                        header=TRUE) %>%
    dplyr::select(CHR, BP, SNP, P)

## Connect to Ensembl Biomart for Arabidopsis
## dataset = "athaliana_eg_gene" : A. thaliana TAIR10 genes
mart <- useMart("plants_mart",
                dataset = "athaliana_eg_gene",
                host    = "https://plants.ensembl.org")

## For each lead SNP: find genes within ±50 kb
results <- lapply(seq_len(nrow(lead_snps)), function(i) {
    row <- lead_snps[i, ]
    genes <- getBM(
        attributes = c("ensembl_gene_id","external_gene_name",
                       "chromosome_name","start_position","end_position",
                       "description"),
        filters    = c("chromosome_name","start","end"),
        values     = list(row$CHR, max(1, row$BP - 50000), row$BP + 50000),
        mart       = mart
    )
    if(nrow(genes) > 0) mutate(genes, lead_snp=row$SNP, snp_p=row$P)
    else NULL
}) %>% bind_rows()

cat("Annotated lead SNPs:\n")
print(results[, c("lead_snp","snp_p","external_gene_name","description")] %>%
        arrange(snp_p))
write.csv(results, "4_plots/lead_snp_gene_annotation.csv", row.names=FALSE)
RSCRIPT

Rscript 4_plots/annotate_snps.R
Download: