Linkage Map Construction

What Is a Genetic Linkage Map?

A genetic linkage map is a diagram of markers arranged in order along each chromosome, with distances measured in centiMorgans (cM) — a unit proportional to the probability of recombination between two points. Unlike a physical map (which measures base pairs), a genetic map measures biological crossing-over frequency.

Why build a linkage map?

QTL mapping: a linkage map is the coordinate system for all QTL analyses. Markers without known positions cannot be used for interval mapping.
Marker-assisted selection: knowing the cM distance between a marker and a gene determines how often they will be separated by recombination in breeding.
Genome assembly: linkage maps are used to order and orient scaffolds into pseudochromosomes.
Comparative genomics: genetic distances reveal recombination hotspots and coldspots invisible from physical sequence.

Term	Definition	Units
Recombination frequency (r)	Proportion of offspring with a recombinant genotype between two markers	Fraction 0–0.5
LOD score	Logarithm of odds that two markers are linked vs. unlinked (r=0.5)	Dimensionless; threshold ≥ 3.0
centiMorgan (cM)	Genetic distance unit: 1 cM = 1% chance of recombination between two markers per meiosis. Calculated from r via a mapping function.	cM (or Morgans, M = cM/100)
Mapping function	Formula that converts r to cM, correcting for double crossovers (Haldane, Kosambi)	—
Linkage group	Set of markers that all show pairwise LOD ≥ 3 with at least one other marker in the group = one chromosome	—

Step 1 — Recombination Frequency

The starting point for any linkage analysis is estimating the recombination frequency (r) between every pair of markers. For a RIL or F2 population, this is done by counting genotype combinations in the offspring.

Formula: Recombination Frequency

r = (N_recombinant) / (N_total)

N_recombinantNumber of individuals with a recombinant genotype between the two markers (AB or BA in a RIL context — where parent 1 = AA and parent 2 = BB, a recombinant is any individual that has different parental alleles at the two markers) N_totalTotal number of individuals (excluding those with missing data at either marker) r range0 = completely linked (no recombination detected); 0.5 = unlinked (independent assortment, different chromosomes)

The interactive spreadsheet below shows the calculation for one pair of markers in a RIL population. Edit the four genotype-count cells to see the recombination frequency update live:

💡

Why can't r exceed 0.5? Even if two markers are on opposite ends of the same chromosome, each meiosis can produce at most one recombinant and one non-recombinant gamete type from a single crossover between them. When many crossovers occur between two distant markers, they average out to r=0.5. This is the upper limit of recombination frequency — markers with r=0.5 behave as if they are unlinked, even if they are on the same chromosome. This is why distant markers must be bridged by intervening markers when building linkage maps.

Step 2 — LOD Score for Linkage

The LOD score tests whether two markers are significantly linked (r < 0.5) versus independent (r = 0.5). It is the log₁₀ ratio of two likelihoods: the probability of observing the data if the markers are linked at frequency r, versus the probability if they are unlinked.

Formula: LOD Score for Linkage (RIL population)

LOD = log₁₀[ L(r) / L(r=0.5) ]

LOD = N_NR × log₁₀(1−r) + N_R × log₁₀(r) − N_total × log₁₀(0.5)

N_NRNumber of non-recombinant individuals (parental genotype classes: AA/AA + BB/BB) N_RNumber of recombinant individuals (AA/BB + BB/AA) N_totalTotal individuals = N_NR + N_R rEstimated recombination frequency = N_R / N_total LOD ≥ 3Conventional threshold for declaring significant linkage (corresponds to ~1000:1 odds in favour of linkage). Some software uses LOD ≥ 2.5 for dense marker sets.

LOD score is NOT a p-value. LOD = 3.0 means the observed data is 10³ = 1000 times more probable under the hypothesis that the two markers are linked at recombination frequency r than under the null hypothesis that they are unlinked (r=0.5). To convert LOD to a p-value: p ≈ 10^−LOD / 2. For LOD=3: p ≈ 0.0005. The threshold LOD=3 gives a false positive rate of roughly 5% across a full genome scan (accounting for the number of pairwise tests).

Step 3 — Converting r to cM: Mapping Functions

Recombination frequency r is not linearly proportional to physical distance because double crossovers cancel each other out. Two mapping functions correct for this:

Haldane (1919): assumes crossovers occur independently (no interference). Overestimates distances for short intervals. Used for organisms where interference is not established.
Kosambi (1944): assumes positive interference (one crossover reduces the probability of a nearby second crossover). More realistic for most eukaryotes. The default in JoinMap and most plant mapping software.

Mapping Functions: r to cM

Haldane: d = −50 × ln(1 − 2r) [cM]

Kosambi: d = 25 × ln((1+2r)/(1−2r)) [cM]

rRecombination frequency (0 to 0.5) dGenetic distance in centiMorgans (cM) lnNatural logarithm (LOG in Excel, LN in R) Relationship1 cM = 1% recombination only when r is very small (<0.05). At r=0.1, Kosambi gives d=10.2 cM; Haldane gives d=10.5 cM. At r=0.3, Kosambi=32.6 cM; Haldane=40.5 cM — diverge substantially at large distances.

🆕

Map function key insight: The Kosambi function is additive under positive interference, meaning d_AC = d_AB + d_BC when B is between A and C (if interference is positive). Haldane distances are additive only when there is no interference. This is why Kosambi is preferred: map distances computed from adjacent marker intervals can be summed to give total chromosome length. Haldane distances cannot be summed naively for long intervals.

Step 4 — Linkage Group Assignment

Before ordering markers, you must assign them to linkage groups (= chromosomes). Two markers belong to the same linkage group if their LOD score exceeds the threshold (typically LOD ≥ 3) and their recombination frequency is below a maximum (typically r ≤ 0.35–0.40). The threshold combination LOD=3 / r=0.35 is the JoinMap default.

Step 5 — Marker Ordering: Sum of Adjacent Recombination Frequencies

Once markers are assigned to linkage groups, they must be ordered along the chromosome to minimise the total map length. The simplest criterion is to find the order of markers that minimises the sum of adjacent recombination frequencies (SARF) or equivalently the total map length in cM.

Objective: Minimise Total Map Length

Total length = ∑ d(M_i, M_i+1) for all adjacent pairs

d(M_i, M_i+1)Kosambi distance between consecutive markers i and i+1 Optimal orderThe permutation of markers that gives the smallest total length. For small numbers of markers (<12), all permutations can be tried. For larger sets, heuristic algorithms (ripple, simulated annealing) are used by JoinMap and CarthaGene.

🧬

The marker ordering problem is NP-hard! For n markers there are n!/2 possible orders. With 20 markers that is over 6 × 10¹⁶ possibilities — impossible to enumerate. JoinMap uses a nearest-neighbour heuristic seeded by the best 3-marker triple, then refines it with a ripple algorithm that swaps adjacent markers and keeps improvements. CarthaGene uses simulated annealing and branch-and-bound. For >200 markers per group, the ordering is always heuristic and may not be globally optimal. This is why replicate mapping populations are valuable: consistent ordering across independent maps validates the result.

Step 6 — JoinMap / R/qtl2 Pipeline

R — Build linkage map from raw genotype matrix using R/qtl + ASMap

######################################################################
## STEP 6: R pipeline for linkage map construction
## Tools: R/qtl (read cross, check data) + ASMap (fast MSTmap algorithm)
## ASMap (https://cran.r-project.org/package=ASMap) wraps the MSTmap
## algorithm — the fastest and most accurate marker ordering method
## for large SNP datasets (handles thousands of markers).
##
## Install: install.packages(c("qtl", "ASMap"))
######################################################################

library(qtl)
library(ASMap)
library(ggplot2)

## ================================================================
## FORMAT: R/qtl CSV format
## ================================================================
## Row 1: marker names (columns) and phenotype column names
## Row 2: chromosome assignment (use NA until map is built)
## Row 3: position in cM (use NA until map is built)
## Rows 4+: individual ID + genotype codes
##
## Genotype codes for RIL:
##   A = homozygous parent 1 (AA)
##   B = homozygous parent 2 (BB)
##   - = missing data
##
## Genotype codes for F2:
##   A = AA, H = AB (heterozygous), B = BB, - = missing

## Example: read a RIL cross from CSV
## cross <- read.cross("csv",
##     file = "myRIL_genotypes.csv",
##     genfile = NULL,
##     na.strings = c("-", "NA"),
##     genotypes = c("A","H","B"))

## Using the built-in soybean RIL dataset for illustration:
data(soybean)           ## From R/qtl: 105 markers, 95 RIL individuals, 3 traits
print(soybean)
## "f2": has H class — this is actually an F2 in the soybean dataset.
## Convert to RIL-style analysis by treating H as missing:
## soybean <- convert2bcsft(soybean, BC.gen=0, F.gen=6)

## ================================================================
## STEP 6a: Data quality checks before mapping
## ================================================================
## 1. Check for genotyping error: markers with > 20% missing data
##    are unreliable and should be removed.
## 2. Check for duplicate individuals (identical genotype vectors)
## 3. Check for segregation distortion (marker deviated from expected
##    50:50 or 25:50:25 ratio — may indicate selection)

## Per-marker missing data
geno_freq <- summary(soybean)$genotyped.pct
low_geno <- names(geno_freq[geno_freq < 80])     ## < 80% genotyped = problematic
cat("Markers with >20% missing data:", length(low_geno), "\n")
if (length(low_geno) > 0) print(low_geno)

## Remove problematic markers
if (length(low_geno) > 0) {
    soybean <- drop.markers(soybean, low_geno)
    cat("Cleaned cross:", totmar(soybean), "markers remaining\n")
}

## Check for duplicate markers (same genotype vector = remove one)
## dup_marks <- findDupMarkers(soybean, exact.only=TRUE)
## soybean   <- drop.markers(soybean, unlist(dup_marks))

## Segregation distortion chi-sq test
gt  <- geno.table(soybean)
seg_dist <- gt[gt$P.value < 0.001,]     ## markers distorted at p < 0.001
cat("Markers with segregation distortion (p<0.001):", nrow(seg_dist), "\n")

## ================================================================
## STEP 6b: Linkage group formation using ASMap
## ================================================================
## mstmap: fast minimum spanning tree marker ordering
## p.value = 1e-6  : LOD ~ 6.0 threshold for grouping (strict)
## p.value = 1e-3  : LOD ~ 3.0 (permissive, good for smaller populations)
## dist.fun = "kosambi" : Kosambi mapping function (default; recommended)
## trace = TRUE    : print progress

soybean_map <- mstmap.cross(
    soybean,
    bychr       = FALSE,       ## FALSE: let ASMap find linkage groups automatically
    trace       = TRUE,
    p.value     = 1e-4,        ## LOD ≈ 4 threshold for linkage grouping
    noMap.dist  = 15,          ## maximum gap (cM) before splitting a linkage group
    noMap.size  = 0,           ## minimum markers per group (0 = no minimum)
    anchor      = FALSE,
    detectBadData = TRUE,      ## flag likely genotyping errors automatically
    dist.fun    = "kosambi"    ## mapping function
)

## ================================================================
## STEP 6c: Examine the genetic map
## ================================================================
## summary: number of chromosomes, markers per LG, total length
summary(soybean_map)

## Total map length
total_len <- sum(chrlen(soybean_map))
cat("Total map length:", round(total_len,1), "cM\n")
cat("Number of linkage groups:", nchr(soybean_map), "\n")
cat("Markers per linkage group:\n")
print(nmar(soybean_map))

## ================================================================
## STEP 6d: Visualise the linkage map
## ================================================================
## plotMap: standard R/qtl map plot
png("linkage_map.png", width=1200, height=600, res=120)
plotMap(soybean_map,
        main = "Genetic Linkage Map",
        show.marker.names = FALSE)    ## set TRUE if few markers
dev.off()

## ================================================================
## STEP 6e: Export map positions for downstream QTL analysis
## ================================================================
## Save the genetic map as a CSV
map_positions <- do.call(rbind, lapply(names(soybean_map$gmap), function(chr) {
    positions <- soybean_map$gmap[[chr]]
    data.frame(marker=names(positions), chr=chr,
               pos_cM=round(positions,3), stringsAsFactors=FALSE)
}))
write.csv(map_positions, "genetic_map_positions.csv", row.names=FALSE)
cat("Map saved: genetic_map_positions.csv\n")
head(map_positions, 10)

Download:

Step 7 — Map Visualisation in R

R — Publication-quality linkage map with ggplot2

######################################################################
## STEP 7: Publication-quality linkage map visualisation
## Creates a chromosome-stick plot with marker tick marks,
## chromosome lengths, and grouped by linkage group
######################################################################

library(ggplot2)
library(dplyr)

## Load the exported map positions (or use from soybean_map above)
map_df <- read.csv("genetic_map_positions.csv")

## Ensure chromosomes are ordered numerically
map_df$chr <- factor(map_df$chr, levels=gtools::mixedsort(unique(map_df$chr)))

## Per-chromosome summary for drawing chromosome sticks
chr_summary <- map_df %>%
    group_by(chr) %>%
    summarise(
        total_cM   = max(pos_cM),
        n_markers  = n(),
        .groups = "drop"
    )
cat("Chromosome summary:\n")
print(chr_summary)

## ----------------------------------------------------------------
## Plot: chromosome sticks with marker ticks
## ----------------------------------------------------------------
ggplot() +
    ## Chromosome backbone (thick vertical line)
    geom_segment(data=chr_summary,
                 aes(x=chr, xend=chr, y=0, yend=total_cM),
                 linewidth=3, color="#2c3e50", lineend="round") +
    ## Marker tick marks
    geom_segment(data=map_df,
                 aes(x=as.numeric(chr)-0.18,
                     xend=as.numeric(chr)+0.18,
                     y=pos_cM, yend=pos_cM),
                 linewidth=0.35, color="#e74c3c", alpha=0.8) +
    ## Chromosome labels at top
    geom_text(data=chr_summary,
              aes(x=chr, y=-3, label=paste0("LG", chr)),
              size=3, fontface="bold", vjust=1) +
    ## Chromosome length label at bottom
    geom_text(data=chr_summary,
              aes(x=chr, y=total_cM+3,
                  label=paste0(round(total_cM,0), " cM")),
              size=2.5, color="#7f8c8d", vjust=0) +
    scale_y_reverse(name="Map position (cM)") +
    scale_x_discrete(name=NULL) +
    labs(
        title    = "Genetic Linkage Map",
        subtitle = paste0(nrow(map_df), " markers — ",
                          nlevels(map_df$chr), " linkage groups — ",
                          round(sum(chr_summary$total_cM),0), " cM total")
    ) +
    theme_minimal(base_size=12) +
    theme(
        axis.text.x  = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid   = element_blank(),
        plot.title   = element_text(face="bold")
    )
ggsave("linkage_map_publication.png", width=10, height=7, dpi=200)
cat("Saved: linkage_map_publication.png\n")