QTL Mapping Across Populations

F2 · RIL · NAM · GWAS populations · R/qtl2 interval mapping · composite interval mapping · LOD threshold permutation · raw data in Excel · GAPIT mixed model · multi-environment QTL

~150 min Intermediate R / Excel
Byte

Concepts: Mapping Population Types

QTL mapping connects genotype (which allele at each marker) to phenotype (the trait value). The power and resolution depends entirely on your mapping population. Choosing the right population design determines everything that follows.

PopulationHow madeRecombinationsResolutionBest for
F2 Cross two inbred parents → F1 → self to get F2 1–2 generations of recombination Low (10–30 cM confidence intervals) Quick pilot QTL study; dominant/additive/epistasis all detectable in one cross
RIL (Recombinant Inbred Lines) F2 selfed for 6–10 generations until near-homozygous 6–10 generations of accumulated recombinations Moderate (5–20 cM). Better than F2. Permanent population — can be replicated. Better QTL resolution than F2. Most common in plants.
BC (Backcross) F1 × recurrent parent 1 generation Low — similar to F2 Introgression breeding; transferring traits from wild relative to elite line
NAM (Nested Association Mapping) Many RIL families each crossing a common reference parent with diverse founders Many generations + diverse haplotype backgrounds High (1–5 cM) — combines linkage + LD Dissecting complex traits with rare alleles; maize, wheat, sorghum NAM panels
GWAS Panel Diverse unrelated accessions/varieties Thousands of generations of historical recombination Very high (<100 kb) but requires many markers Fine mapping; identifying causative SNPs; human disease genetics
Byte
Why does population type affect analysis method?
  • F2 and BC: Each individual has 3 genotype classes at each marker (AA, AB, BB for F2; AA, AB for BC). Test whether mean phenotype differs across genotype classes using ANOVA/regression.
  • RIL: Nearly homozygous (only AA or BB). Simpler 2-class model. Because they can be replicated (same genotype, multiple plants), phenotyping is more accurate.
  • GWAS: Population structure (relatedness between accessions) inflates false positives. MUST use a mixed model (MLM) with a kinship/relatedness matrix to control for structure. R/qtl2 not needed — use GAPIT or GEMMA.

Raw Data: What QTL Input Data Looks Like

Before writing code, it's critical to understand the structure of QTL input data. There are two tables you always need:

Table 1: Genotype Matrix
IndividualM1_chr1_10MbM2_chr1_25MbM3_chr2_5Mb
RIL_001AAAABB
RIL_002BBABAA
RIL_003AABBBB
RIL_004BBBBAA
Table 2: Phenotype Data
Individualdays_to_flowerplant_height_cmyield_g
RIL_00162.498.234.1
RIL_00271.0112.528.7
RIL_00358.889.640.2
RIL_00479.5125.322.4

You also need a genetic map — a table that tells the software where each marker sits in the genome:

CSV — Genetic map format (R/qtl2 compatible)
## Genetic map: three columns
## marker  = marker name (must match column headers in genotype matrix)
## chr     = chromosome number (use integers, not "chr1")
## pos     = genetic position in centiMorgans (cM)
##           OR physical position in Mbp (use pos_type="Mbp" in R/qtl2)
## NOTE: markers within a chromosome must be in ORDER of increasing position

marker,chr,pos
M1_chr1_10Mb,1,0.0
M2_chr1_25Mb,1,18.4
M3_chr2_5Mb,2,0.0
M4_chr2_18Mb,2,12.1
M5_chr3_8Mb,3,0.0

## If you have SNP positions in bp but no genetic map:
## Use a physical map (bp or Mbp) and set pos_type="Mbp" in R/qtl2
## OR estimate genetic positions from recombination frequencies between markers
## (1 cM ≈ 1 Mbp in many plants, but varies widely across the genome)

## For SNP arrays (e.g. Illumina), genetic positions are usually provided
## in the annotation file that comes with the array.
## For GBS/whole-genome sequencing, use physical positions in Mbp.

Excel: Understanding QTL Logic Without Code

Before using R/qtl2, understanding QTL analysis in Excel builds intuition for what the software is actually doing mathematically. A QTL test at a single marker is just a one-way ANOVA comparing phenotype means between genotype classes.

Excel walkthrough — single-marker QTL test by hand
## ================================================================
## EXCEL WALKTHROUGH: Single-marker QTL test for one marker
## ================================================================
## We want to test: "Does genotype at marker M2 affect days_to_flowering?"
##
## STEP 1: Organize your data in Excel
##
## Column A: Individual ID
## Column B: Marker M2 genotype (AA or BB for RIL; AA/AB/BB for F2)
## Column C: Phenotype (days_to_flowering)
##
##   A           B         C
##   RIL_001     AA        62.4
##   RIL_002     BB        71.0
##   RIL_003     AA        58.8
##   RIL_004     BB        79.5
##   RIL_005     AA        60.1
##   ...         ...       ...
##
## STEP 2: Calculate genotype class means
## In a new area of the sheet:
##
##   Genotype   Mean DTF           N
##   AA         =AVERAGEIF(B:B,"AA",C:C)    =COUNTIF(B:B,"AA")
##   BB         =AVERAGEIF(B:B,"BB",C:C)    =COUNTIF(B:B,"BB")
##
## STEP 3: Run a t-test (for 2 classes: AA vs BB in RIL)
## 1. Data tab → Data Analysis → t-Test: Two-Sample Assuming Unequal Variances
## 2. Variable 1 Range: select all phenotype values where genotype = AA
##    (sort by column B first to group AA and BB together)
## 3. Variable 2 Range: select all phenotype values where genotype = BB
## 4. Alpha: 0.05
##
## The t-test p-value is the probability that the observed difference
## in means between AA and BB occurred by chance.
## p < 0.05 = marker is significantly associated with the trait at that marker.
##
## STEP 4: Additive effect = (Mean_BB - Mean_AA) / 2
##   This is the allele substitution effect: how much does replacing
##   one A allele with one B allele change the phenotype (on average)?
##   Positive = BB allele increases phenotype
##   Negative = AA allele increases phenotype
##
## STEP 5: R² (variance explained) = proportion of phenotypic variance
## explained by the marker genotype
##   =RSQ(C:C, IF(B:B="AA",0,1))
##   (enter as array formula: Ctrl+Shift+Enter)
##   OR use CORREL(phenotype, genotype_numeric)^2
##   where genotype_numeric: AA=0, BB=1 (or -1/+1 for centered coding)
##
## STEP 6: LOD score
## LOD = -log10(p_value)
## For p=0.001: LOD = 3.0
## For p=0.0001: LOD = 4.0
## A LOD score of 3.0 is approximately equivalent to p=0.001
## The genome-wide significance threshold is typically LOD=3.0 for plants
## (but must be validated by permutation test — see Step 3 in R/qtl2 section)
##
## =================================================================
## PRACTICAL EXCEL FORMULA FOR LOD SCORE:
## =================================================================
## In Excel, if cell E2 has your t-test p-value:
##   =-LOG10(E2)
## This gives you the LOD score directly from the p-value.
##
## Do this for EVERY marker → plot LOD vs marker position → QTL peaks!
## This is exactly what R/qtl2 does, but for all markers simultaneously
## and with interval mapping (tests BETWEEN markers too).

## ================================================================
## DOWNLOAD TEMPLATE:
## https://github.com/kbroman/qtl2/blob/main/inst/extdata/
## The R/qtl2 package provides example datasets you can open in Excel:
## install.packages("qtl2")
## library(qtl2)
## iron <- read_cross2(system.file("extdata/iron.zip", package="qtl2"))
## To save as CSV for Excel:
## write.csv(iron$pheno, "iron_phenotypes.csv")
## write.csv(as.data.frame(iron$gmap), "iron_genetic_map.csv")
## ================================================================

Try the live interactive spreadsheet below — it runs the full Excel LOD formula in your browser. Edit any yellow cell and all results update instantly:

📊
The Excel approach reveals the math: R/qtl2’s interval mapping tests “virtual markers” between each pair of real markers using the probability that the true QTL genotype at a given position is AA or BB, given the flanking observed marker genotypes. This is just a weighted regression — but doing it in Excel first for a single marker makes the LOD score and additive effect completely transparent before you run 5,000 marker regressions in R.

Step 1 — F2 Population Setup in R/qtl2

R — Load F2 cross data, check quality, visualise genotype frequencies
######################################################################
## STEP 1: Load and inspect a F2 mapping population in R/qtl2
## Dataset: The "hyper" dataset from R/qtl2 — F2 cross in mice
## studying blood pressure (hypertension QTL). 250 F2 individuals,
## 174 markers across 19 chromosomes.
## This is the classic QTL textbook dataset from Karl Broman.
######################################################################

## Install packages
install.packages(c("qtl2", "ggplot2", "tidyverse"))
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
## No Bioconductor packages required for basic R/qtl2

library(qtl2)
library(ggplot2)
library(tidyverse)

## ================================================================
## LOAD THE CROSS OBJECT
## ================================================================
## R/qtl2 uses a "cross2" object that contains:
##   geno  : list of genotype matrices (one per chromosome)
##   pheno : phenotype data frame
##   gmap  : genetic map (cM positions)
##   cross_info : population type information
##
## For the classic "hyper" dataset (F2 backcross in mice):
## Download from R/qtl2 example data:
hyper <- read_cross2(system.file("extdata/hyper.zip", package="qtl2"))
## Or for the iron dataset (F2 intercross, more complex):
## iron <- read_cross2(system.file("extdata/iron.zip", package="qtl2"))

## Quick summary of the cross object
print(hyper)
## Output: F2 cross with 250 individuals, 174 markers on 19 chromosomes, 1 phenotype

## ================================================================
## UNDERSTAND CROSS TYPES IN R/qtl2
## ================================================================
## cross_type matters for genotype probability calculations:
##   "f2"     : F2 intercross — 3 classes (AA, AB, BB)
##   "riself" : RIL by selfing — 2 classes (AA, BB)
##   "risib"  : RIL by sibling mating
##   "bc"     : Backcross — 2 classes (AA, AB)
##   "dh"     : Doubled haploid — 2 classes
##   "bcsft"  : BC + selfing; "ail" : Advanced intercross lines
print(hyper$crosstype)   ## "f2"

## ================================================================
## QUALITY CHECK: Genotype frequency plots
## ================================================================
## Expected genotype frequencies for F2 intercross:
##   AA : 0.25, AB : 0.50, BB : 0.25 (Mendelian segregation)
## Deviations = segregation distortion (selection, transmission ratio distortion)
##
## expected_geno_freqs: check if genotype frequencies match Mendelian expectations
gfreq <- geno_freq(hyper)  ## returns list (one matrix per chr) of genotype frequencies

## Convert to data frame for ggplot
gfreq_df <- do.call(rbind, lapply(names(gfreq), function(chr) {
    mat <- gfreq[[chr]]
    data.frame(chr=chr, marker=rownames(mat), as.data.frame(mat))
}))
colnames(gfreq_df)[3:5] <- c("AA", "AB", "BB")

ggplot(gfreq_df %>% pivot_longer(c(AA,AB,BB), names_to="geno", values_to="freq"),
       aes(marker, freq, color=geno, group=geno)) +
    geom_line() +
    geom_hline(yintercept=c(0.25, 0.50), linetype=2, color="grey50") +
    facet_wrap(~chr, scales="free_x", nrow=4) +
    labs(title="Genotype frequencies per marker (F2, expected: AA=BB=0.25, AB=0.5)",
         x=NULL, y="Frequency") +
    theme_bw(base_size=10) +
    theme(axis.text.x=element_blank())
ggsave("genotype_frequencies.png", width=14, height=8, dpi=120)

## ================================================================
## PHENOTYPE DISTRIBUTION
## ================================================================
pheno_df <- as.data.frame(hyper$pheno)
ggplot(pheno_df, aes(x=bp)) +
    geom_histogram(bins=30, fill="steelblue", color="white") +
    labs(title="Phenotype distribution: blood pressure (hyper F2 cross)",
         x="Blood pressure (mmHg)", y="Count") +
    theme_bw(base_size=13)
ggsave("phenotype_distribution.png", width=7, height=4, dpi=150)
Download:

Step 2 — Interval Mapping & LOD Score Profiles

R — Genotype probabilities and interval mapping with R/qtl2
######################################################################
## STEP 2: Calculate genotype probabilities and run interval mapping
## Interval mapping tests "virtual markers" BETWEEN real markers.
## At each genomic position, R/qtl2 calculates P(genotype | flanking markers)
## using the genetic map and a hidden Markov model.
## Then it fits a regression: phenotype ~ P(AA) + P(AB) [for F2]
## The fit quality (compared to null model) becomes the LOD score.
##
## HIGHER LOD = marker/interval explains MORE variance = stronger QTL evidence
######################################################################

library(qtl2)

## ================================================================
## STEP 2a: Calculate genotype probabilities
## ================================================================
## calc_genoprob: uses genetic map distances to calculate Pr(true genotype | observed markers)
## for each individual at each position in the genome.
## error_prob = 0.002 : allow for 0.2% genotyping error (typical for SNP arrays)
## map_function = "kosambi": Kosambi mapping function (standard for most organisms)
##   Alternative: "haldane" (no interference; used for some animals)
##   Alternative: "morgan" (linear; rarely used)
## step = 1 : evaluate genotype probabilities every 1 cM across the genome
##   step=0 tests only at observed marker positions (faster, lower resolution)
##   step=1 adds virtual positions between markers (recommended for QTL intervals)
gprob <- calc_genoprob(
    cross      = hyper,
    map        = hyper$gmap,
    error_prob = 0.002,
    map_function = "kosambi",
    step       = 1          # test every 1 cM between markers
)

## ================================================================
## STEP 2b: Run interval mapping (genome scan)
## ================================================================
## scan1: runs a genome scan using genotype probabilities
## model = "normal" : linear regression (for normally distributed phenotypes)
## model = "binary" : logistic regression (for presence/absence traits)
## For covariates (sex, treatment, batch): use addcovar= argument

## Without covariates:
lod <- scan1(genoprobs=gprob, pheno=hyper$pheno[,"bp", drop=FALSE])

## With additive covariate (sex):
## sex_covar <- as.matrix(hyper$covar[,"sex",drop=FALSE])
## lod_sex <- scan1(genoprobs=gprob, pheno=hyper$pheno, addcovar=sex_covar)

## ================================================================
## STEP 2c: Plot the LOD score profile (QTL map)
## ================================================================
## The LOD profile is the key QTL plot: X axis = chromosome position, Y axis = LOD score
## Peaks above the significance threshold = QTL locations

## Plot with R/qtl2 built-in function:
png("qtl_lod_profile.png", width=1400, height=600, res=120)
plot_scan1(lod, map=hyper$gmap,
           main="QTL scan — blood pressure (hyper F2 cross)",
           ylim=c(0,8))
abline(h=3.0, col="red", lty=2)    # approximate LOD threshold (3.0 = p≈0.05 genome-wide)
dev.off()

## ================================================================
## STEP 2d: Identify significant QTL peaks
## ================================================================
## find_peaks: extracts QTL peak locations above a LOD threshold
## threshold = 3.0 (genome-wide LOD 3.0 is traditional; use permutation for rigor)
## peakdrop = 1.5 : a peak must drop at least 1.5 LOD below adjacent peaks to count separately
## expand2markers = TRUE : expand interval to nearest typed markers
qtl_peaks <- find_peaks(
    lod,
    map       = hyper$gmap,
    threshold = 3.0,
    peakdrop  = 1.5,
    expand2markers = FALSE
)
print(qtl_peaks)
## Output: chr, pos (cM), lod, trait name
## Expected: major QTL on chr 4 (LOD~4.5) and chr 1 for blood pressure

## ================================================================
## STEP 2e: Additive and dominance effects at the major QTL
## ================================================================
## est_effects: estimates additive (a) and dominance (d) effects at the QTL peak
## For F2: a = (mean_BB - mean_AA)/2 ; d = mean_AB - (mean_AA + mean_BB)/2
## For RIL: a = (mean_BB - mean_AA)/2 ; d should be near 0 (lines are homozygous)
major_qtl_chr <- qtl_peaks$chr[which.max(qtl_peaks$lod)]
major_qtl_pos <- qtl_peaks$pos[which.max(qtl_peaks$lod)]

effects <- est_effects(gprob[[major_qtl_chr]], pheno=hyper$pheno[,"bp", drop=FALSE])
cat("\n=== Allele effects at major QTL (chr", major_qtl_chr, ") ===\n")
print(effects$coef["a",])    # additive effect
print(effects$coef["d",])    # dominance effect
Download:
LOD score decoded
📊What is LOD?LOD (Logarithm of Odds) = log10(likelihood of QTL model / likelihood of no-QTL model). LOD 3.0 means the QTL model is 103 = 1000× more likely than no QTL. LOD 4.0 = 10,000× more likely. It is NOT a p-value, but LOD ≥3 roughly corresponds to p<0.05 after genome-wide correction for most populations. Use permutation (Step 3) for the exact threshold for your data.
🎯Additive vs dominant effectAdditive effect (a): each substitution of one B allele for one A allele changes the phenotype by a units. Completely additive: heterozygote (AB) = midpoint between AA and BB. Dominance (d): deviation of the heterozygote from the midpoint. d=+a means complete dominance of B; d=0 means perfect additivity. Most QTL for complex traits are partially additive.

Step 3 — LOD Threshold by Permutation Test

R — Permutation test for genome-wide LOD significance threshold
######################################################################
## STEP 3: Permutation test for LOD significance threshold
## The LOD threshold (e.g. LOD=3.0) is an APPROXIMATION.
## The correct threshold for your specific population must be estimated
## by permutation: shuffle phenotype labels, run genome scan, record max LOD.
## After 1000 permutations, the 95th percentile of max LOD values = threshold.
##
## This accounts for:
##   - Number of markers (more markers = stricter threshold)
##   - Genetic map length (longer map = stricter threshold)
##   - Population size (smaller N = more variable = sometimes stricter)
##   - Distribution of your specific phenotype
######################################################################

## --- Run permutation test ---
## n_perm = 1000 : standard number; 10000 for publication
## cores = 4 : use parallel processing (crucial — 1000 perms takes hours on 1 core)
## This may take 30-60 minutes depending on population size and n_perm
set.seed(42)
perms <- scan1perm(
    genoprobs = gprob,
    pheno     = hyper$pheno[,"bp", drop=FALSE],
    n_perm    = 1000,
    cores     = 4      # use 4 parallel cores
)

## --- Extract significance thresholds ---
## alpha = 0.05: genome-wide 5% significance threshold
## alpha = 0.20: suggestive threshold (commonly reported alongside significant)
thresholds <- summary(perms, alpha=c(0.05, 0.10, 0.20))
print(thresholds)
## Example output:
##       LOD (bp)
## 5%    3.46
## 10%   3.12
## 20%   2.78
cat("Genome-wide LOD threshold at α=0.05:", thresholds[1,"bp"], "\n")

## --- Save permutation results ---
saveRDS(perms, "permutation_results.rds")

## --- Final QTL peaks at permutation-corrected threshold ---
lod_threshold <- thresholds[1,"bp"]    # 5% threshold

qtl_significant <- find_peaks(
    lod,
    map       = hyper$gmap,
    threshold = lod_threshold,
    peakdrop  = 1.5
)
cat("\n=== Significant QTL at LOD >=", round(lod_threshold, 2), "===\n")
print(qtl_significant)

## --- Confidence intervals for each QTL ---
## 1.5 LOD support interval: the genomic region where LOD drops by at most 1.5 units from peak
## This is the standard confidence interval for QTL location in R/qtl
## Bayesian credible interval is an alternative (use bayes_ci in R/qtl2)
for (i in seq_len(nrow(qtl_significant))) {
    chr_i <- qtl_significant$chr[i]
    ci <- lod_int(lod, map=hyper$gmap, chr=chr_i, drop=1.5)
    cat("QTL on chr", chr_i, ": peak at", qtl_significant$pos[i], "cM,",
        "1.5-LOD interval:", ci[1,"lower"], "–", ci[1,"upper"], "cM\n")
}

## --- Re-plot with permutation threshold line ---
png("qtl_lod_profile_final.png", width=1400, height=600, res=120)
plot_scan1(lod, map=hyper$gmap,
           main=paste0("QTL scan — LOD threshold at α=0.05 (", round(lod_threshold,2), ")"))
abline(h=lod_threshold, col="red", lty=2, lwd=2)
legend("topright", lty=2, col="red", lwd=2,
       legend=paste0("α=0.05 threshold (LOD=", round(lod_threshold,2), ")"))
dev.off()
Download:
🤔
Why is my permutation threshold different from LOD=3? LOD=3 was derived theoretically for a population of ~200 individuals with a typical plant genome (~1500 cM). Your permutation threshold may differ if: you have many more/fewer individuals, a larger/smaller genome, or a very non-normal phenotype distribution. For mice (~1400 cM total): expect ~3.4–3.6. For plants with large genomes (>3000 cM): expect ~3.8–4.2. Always report your actual permutation threshold, not LOD=3.

Step 4 — RIL Population & Composite Interval Mapping (CIM)

R — RIL population setup, composite interval mapping, multiple QTL model
######################################################################
## STEP 4: RIL population and Composite Interval Mapping (CIM)
## CIM improves QTL resolution and power by including background markers
## as cofactors in the regression model.
## This controls for the effects of other QTL throughout the genome,
## reducing the "ghost QTL" problem where a QTL appears between two
## linked QTL even if no QTL is present there.
##
## R/qtl2 does not have a built-in CIM function.
## We use the qtl package (R/qtl, the predecessor) for CIM,
## then switch back to qtl2 for the final multiple-QTL model.
######################################################################

## R/qtl (older package) has CIM and stepwiseqtl:
## install.packages("qtl")
library(qtl)
library(qtl2)
library(ggplot2)

## ================================================================
## Load a RIL dataset (sorghum heading date, from Mace et al.)
## This is a publicly available RIL QTL dataset in R/qtl format
## ================================================================
## Download the sorghum MAGIC population data or use the built-in listeria example
## (listeria uses F2; we convert it to simulate a RIL for illustration)
data(listeria)    # F2 cross in R/qtl — using for illustration of CIM steps

## Convert to RIL (illustration — normally you'd load real RIL data)
## For real RIL: read.cross("csv", file="genos.csv", genfile="phenos.csv")
## RIL cross type: "riself" in R/qtl2 or "riself" in R/qtl

## ================================================================
## SIMPLE INTERVAL MAPPING (for comparison with CIM)
## ================================================================
listeria <- calc.genoprob(listeria, step=2, error.prob=0.001)
lod_im   <- scanone(listeria, pheno.col="T264", model="normal")

## ================================================================
## COMPOSITE INTERVAL MAPPING (CIM)
## ================================================================
## Cofactor selection: choose background markers to control for
## n.marcovar = 3 : use 3 background markers as cofactors
## Method "hk" : Haley-Knott regression (fast; near-identical to EM algorithm)
## Method "em" : EM algorithm (slower but slightly more accurate)
lod_cim <- cim(listeria,
               pheno.col = "T264",
               n.marcovar = 3,     # number of cofactor markers
               method     = "hk",  # regression method
               window     = 10)    # window size: exclude markers within 10 cM of test position

## ================================================================
## COMPARE IM vs CIM
## ================================================================
## Plot both on the same graph
png("im_vs_cim.png", width=1400, height=600, res=120)
plot(lod_im, lod_cim, col=c("blue","red"),
     main="Interval Mapping vs Composite Interval Mapping")
legend("topright", c("IM","CIM"), col=c("blue","red"), lwd=2)
dev.off()
## CIM peaks are sharper and may reveal new QTL hidden by linkage in IM

## ================================================================
## MULTIPLE QTL MODEL (R/qtl2 approach)
## ================================================================
## After identifying QTL peaks, fit a joint model:
## This properly estimates effects of each QTL while controlling for others

## Using R/qtl2 iron dataset for multiple QTL model (has well-known QTL)
iron <- read_cross2(system.file("extdata/iron.zip", package="qtl2"))
iron_probs <- calc_genoprob(iron, map=iron$gmap, error_prob=0.002)

## Fit a multiple QTL model: sex + QTL on chr 7 + QTL on chr 15
## Additive model (no interactions between QTL):
## sex is a known covariate that must be included
sex_covar <- as.matrix(iron$covar[,"sex",drop=FALSE])

fit_multi <- fit1(
    genoprobs = subset(iron_probs, chr=c("7","15")),   # test two QTL
    pheno     = iron$pheno[,"liver",drop=FALSE],
    addcovar  = sex_covar
)
cat("\n=== Multiple QTL model LOD:", fit_multi$lod, "\n")
cat("R² =", fit_multi$lod_fv1/fit_multi$lod_bv1, "\n")
## R² = proportion of total phenotypic variance explained by the multiple QTL model
Download:

Step 5 — GWAS with GAPIT (Diverse Panel / NAM)

R — GAPIT3 MLM GWAS, population structure control, Manhattan plot
######################################################################
## STEP 5: GWAS with GAPIT3
## GAPIT (Genome Association and Prediction Integrated Tool) implements
## multiple mixed-model GWAS methods:
##   MLM  : Mixed Linear Model (Yu et al. 2006) — uses Q+K structure control
##   BLINK: Bayesian-information and Linkage-disequilibrium Iteratively
##          Nested Keyway — finds causal SNPs without iterative masking bias
##   FarmCPU: Fixed and random model Circulating Probability Unification
##            Uses candidate SNPs as covariates — very powerful for large panels
##
## KEY DIFFERENCE FROM R/qtl2: You MUST control for population structure!
## Without K matrix: systematic false positives from population structure
## With Q+K model: controls both stratification (Q = PCA) and kinship (K matrix)
######################################################################

## Install GAPIT3
if (!requireNamespace("GAPIT3", quietly=TRUE)) {
    install.packages("devtools")
    devtools::install_github("jiabowang/GAPIT3")
}
library(GAPIT3)
library(tidyverse)

## ================================================================
## LOAD DATA (Arabidopsis 1001G example — same as GWAS tutorial)
## ================================================================
## Genotype data: numeric coded (0=AA, 1=AB, 2=BB) or HapMap format
## GAPIT uses HapMap format: rows=markers, columns=individuals
## Download Arabidopsis 1001G data from 1001genomes.org

## For this tutorial we use a small subset already in numeric format
## (download from GAPIT3 example data on GitHub):
## https://github.com/jiabowang/GAPIT3/tree/master/data

## Read genotype (hapmap format)
## myG <- read.delim("ATH_geno.hmp.txt", header=FALSE, check.names=FALSE)
## Read phenotype
## myY <- read.csv("ATH_pheno.csv", header=TRUE, check.names=FALSE)

## For illustration, use GAPIT3 built-in demo data:
data("mdp_numeric")   # numeric genotype matrix
data("mdp_traits")    # phenotype data

## ================================================================
## RUN GAPIT3 MLM
## ================================================================
## GAPIT model selection guide:
##   MLM   : classic, widely used, conservative (few false positives)
##   BLINK : recommended for large panels (>500 accessions); faster
##   FarmCPU: similar to BLINK; best for polygenic traits
## Run ALL three and compare — consistent hits across models = high confidence
setwd("gwas_output/")
dir.create("gwas_output/", showWarnings=FALSE)

myGAPIT_MLM <- GAPIT(
    Y            = myY[,1:2],       # first 2 columns: ID + first trait
    G            = myG,             # genotype in HapMap format
    model        = "MLM",           # mixed linear model
    PCA.total    = 5,               # use top 5 PCs to control population stratification
    ## PCA.total: how many PCs to include as fixed-effect covariates
    ## Too few: inflation due to uncontrolled structure
    ## Too many: overfit; lose power
    ## Use 3-10 for most populations; check Q-Q plot to select
    kinship.algorithm = "VanRaden", # kinship matrix method (standard)
    SNP.fraction = 1,               # use all SNPs for kinship matrix (set to 0.5 for large datasets)
    file.output  = TRUE             ## saves all plots and tables to files
)

## ================================================================
## INTERPRET GAPIT OUTPUT
## ================================================================
## GAPIT produces automatically:
##   GAPIT.Association.GWAS_Results.MLM.csv  : full results table
##   GAPIT.Association.Manhattan.MLM.png     : Manhattan plot
##   GAPIT.Association.QQ.MLM.png            : Q-Q plot (check for inflation)
##   GAPIT.PCA.csv                           : principal components

## Read results
results <- read.csv("gwas_output/GAPIT.Association.GWAS_Results.MLM.csv")
colnames(results)
## Key columns: SNP, Chr, Pos, P.value, maf, FDR_Adj_P.values (Bonferroni or BH)

## Bonferroni correction for genome-wide significance:
## p_threshold = 0.05 / number_of_tests
p_bonf <- 0.05 / nrow(results)
cat("Bonferroni threshold: p <", formatC(p_bonf, format="e", digits=2), "\n")

## Count significant hits
sig_hits <- results %>% filter(P.value < p_bonf)
cat("Genome-wide significant SNPs:", nrow(sig_hits), "\n")
cat("Significant loci (approx, removing proximal duplicates):\n")
print(sig_hits %>% select(SNP, Chr, Pos, P.value, maf) %>% head(20))

## ================================================================
## Q-Q PLOT INTERPRETATION
## ================================================================
## The Q-Q plot compares observed -log10(p) values to expected under the null.
## Points above the diagonal = true association signals
## ALL points inflated (λ > 1.1): model is NOT controlling structure correctly
##   → Add more PCs or use a different kinship algorithm
## λ (genomic inflation factor):
lambda <- median(qchisq(1 - results$P.value, 1), na.rm=TRUE) / qchisq(0.5, 1)
cat("Genomic inflation factor λ =", round(lambda, 3), "\n")
## λ < 1.05: well-controlled. λ > 1.1: structure inflation.
Download:

Step 6 — Multi-Environment QTL Analysis

R — QTL × Environment interaction using multi-environment scan
######################################################################
## STEP 6: Multi-environment QTL — test for QTL × environment interaction
## When a trait is measured in multiple environments (locations, years),
## you want to know:
##   1. Which QTL are stable across environments? (most useful for MAS)
##   2. Which QTL show genotype × environment (G×E) interaction?
##      (QTL effect changes sign or magnitude across environments)
##
## Method: include environment as an interactive covariate in scan1
## The interaction term tests: does the QTL effect differ between environments?
######################################################################

library(qtl2)

## Assume pheno has columns: trait_env1, trait_env2, trait_env3
## (same trait measured in 3 environments)
## Build environment covariate matrix:
n_ind <- 200
env_labels <- rep(1:3, each=n_ind/3)
int_covar  <- cbind(env=env_labels)   # numeric environment codes

## -- Scan with environment interaction --
## intcovar: QTL × environment interaction term
## This tests whether the QTL effect VARIES across environments
## lod_interaction = LOD(full: QTL + QTL:env) - LOD(additive: QTL + env)
## A significant interaction LOD = QTL effect depends on environment

## For real data: run with your actual multi-environment phenotype data
## lod_multi <- scan1(genoprobs = gprob,
##                    pheno = pheno_matrix,   ## columns: trait_env1, trait_env2, trait_env3
##                    addcovar = env_covar,   ## main effect of environment (additive)
##                    intcovar = env_covar)   ## QTL×environment interaction

## --- Alternative: calculate BLUPs per QTL genotype class across environments ---
## A cleaner approach for presentation: extract mean phenotype by (QTL genotype × environment)
## and visualise as a crossover interaction plot
## (example with simulated data below):
set.seed(42)
n_ind <- 150
sim_data <- data.frame(
    individual = 1:n_ind,
    qtl_geno   = sample(c("AA","BB"), n_ind, replace=TRUE),
    env        = sample(c("E1","E2","E3"), n_ind, replace=TRUE),
    phenotype  = 50 + rnorm(n_ind, 0, 8)
)
## Add QTL effect that reverses in E3 (classic crossover G×E)
sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E1"] <- 
    sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E1"] + 8
sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E2"] <- 
    sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E2"] + 4
sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E3"] <- 
    sim_data$phenotype[sim_data$qtl_geno=="BB" & sim_data$env=="E3"] - 5

means <- sim_data %>%
    group_by(qtl_geno, env) %>%
    summarise(mean_pheno = mean(phenotype), se=sd(phenotype)/sqrt(n()), .groups="drop")

library(ggplot2)
ggplot(means, aes(x=env, y=mean_pheno, color=qtl_geno, group=qtl_geno)) +
    geom_line(linewidth=1.2) +
    geom_point(size=3) +
    geom_errorbar(aes(ymin=mean_pheno-se, ymax=mean_pheno+se), width=0.1) +
    labs(title="QTL × Environment interaction (crossover in E3)",
         subtitle="BB allele increases phenotype in E1, decreases in E3 — unstable QTL",
         x="Environment", y="Mean phenotype", color="QTL genotype") +
    theme_bw(base_size=13)
ggsave("qtl_by_environment.png", width=8, height=5, dpi=150)
Download:

Step 7 — Marker-Assisted Selection (MAS) Summary

R — Extract flanking markers for MAS, calculate expected breeding values
######################################################################
## STEP 7: Marker-Assisted Selection — practical output from QTL analysis
## MAS uses the QTL markers as proxies for selecting favourable alleles
## without phenotyping every plant. The key outputs breeders need:
##   1. Best flanking markers for each QTL (most tightly linked)
##   2. Favourable allele at each QTL (which parent contributes the benefit?)
##   3. Additive effect size (how much yield/quality improvement per QTL?)
##   4. Breeding value prediction (expected performance of crosses)
######################################################################

library(qtl2)
library(tidyverse)

## Reload results from previous steps
hyper      <- read_cross2(system.file("extdata/hyper.zip", package="qtl2"))
gprob      <- calc_genoprob(hyper, map=hyper$gmap, error_prob=0.002, step=1)
lod        <- scan1(genoprobs=gprob, pheno=hyper$pheno[,"bp",drop=FALSE])

## --- Step 7a: Extract QTL peak positions ---
qtl_peaks <- find_peaks(lod, map=hyper$gmap, threshold=3.0, peakdrop=1.5)
cat("QTL summary for MAS:\n")
print(qtl_peaks)

## --- Step 7b: Find nearest typed markers to each QTL peak ---
## These are the markers you would use for MAS genotyping in the field
find_nearest_marker <- function(cross, chr, pos_cM) {
    gmap_chr <- cross$gmap[[chr]]
    nearest  <- names(which.min(abs(gmap_chr - pos_cM)))
    return(nearest)
}

mas_markers <- qtl_peaks %>%
    rowwise() %>%
    mutate(
        mas_marker   = find_nearest_marker(hyper, chr, pos),
        marker_pos   = hyper$gmap[[chr]][mas_marker]
    ) %>%
    ungroup()
cat("\n=== MAS Marker Recommendations ===\n")
print(mas_markers %>% select(chr, pos, lod, mas_marker, marker_pos))

## --- Step 7c: Favourable allele identification ---
## For each QTL, determine which parental allele is beneficial:
## Positive additive effect = BB allele increases phenotype (use B parent as donor)
## Negative additive effect = AA allele increases phenotype (use A parent as donor)

## Calculate means by genotype at each MAS marker
for (i in seq_len(nrow(mas_markers))) {
    chr_i <- mas_markers$chr[i]
    marker_i <- mas_markers$mas_marker[i]
    
    ## Get genotype at MAS marker for each individual
    geno_marker <- maxmarg(gprob[[chr_i]])[, marker_i]  ## most likely genotype class
    
    ## Calculate mean phenotype per genotype class
    means_by_geno <- tapply(hyper$pheno[,"bp"], geno_marker, mean, na.rm=TRUE)
    
    cat("\nQTL on chr", chr_i, "(MAS marker:", marker_i, ")\n")
    cat("  Mean phenotype by genotype:\n")
    for (geno in names(means_by_geno)) {
        cat("   Genotype class", geno, ":", round(means_by_geno[geno], 2), "\n")
    }
    
    ## Determine direction: which allele is beneficial
    if (length(means_by_geno) == 2) {
        diff <- means_by_geno[2] - means_by_geno[1]
        if (diff > 0) {
            cat("  --> B allele INCREASES phenotype by", round(diff, 2), "\n")
        } else {
            cat("  --> A allele INCREASES phenotype by", round(-diff, 2), "\n")
        }
    }
}

## --- Export MAS guide table ---
write.csv(mas_markers, "MAS_marker_guide.csv", row.names=FALSE)
cat("\nMAS marker guide saved: MAS_marker_guide.csv\n")
cat("Use these markers for genotyping selection in breeding program.\n")
Download: