Haplotype Phasing

Resolve diploid genotypes into phased haplotypes using SHAPEIT4 (statistical), WhatsHap (read-backed), and long-read methods. Covers haplotype block detection, switch error evaluation, and applications in selection scans and breeding.

~120 min Intermediate SHAPEIT4 / WhatsHap / R
Byte

What is Phasing and Why Does It Matter?

When you genotype a diploid individual, you get unphased genotypes: at a heterozygous site you know the individual has alleles A and B, but not which chromosome each allele is on. Phasing resolves this ambiguity, assigning each allele to its parental haplotype (chromosome copy).

UnphasedPhased (two haplotypes)Biological meaning
SNP1: A/B, SNP2: C/DHap1: A-C   Hap2: B-DAlleles A and C are inherited together on the same chromosome copy
Same genotypesHap1: A-D   Hap2: B-CAlternative phase -- A inherited with D on one chromosome
Byte
Why phasing matters in practice:
  • Selection scans (iHS/XP-EHH): REQUIRE phased haplotypes. Without phasing these statistics cannot be computed.
  • Compound heterozygosity: Two mutations in trans (on different chromosomes) = loss-of-function. Two mutations in cis (same chromosome) = one functional copy remains. Clinically critical distinction.
  • Haplotype-based breeding: Marker-assisted selection and genomic selection work better with haplotype alleles than individual SNP alleles.
  • Imputation reference panels: All major imputation tools (Beagle, IMPUTE5) require a phased reference panel.

Phasing Methods Compared

MethodData neededAccuracyBest use caseTool
Statistical (population)Large cohort of unrelated individuals (N > 100 ideal)High for common variants; degrades for rare variantsPopulation studies, GWAS, imputation panelsSHAPEIT4, BEAGLE5
Read-backedShort-read WGS (same sample BAM); works with N=1Limited to heterozygous sites on the same read pair (~500 bp)Clinical samples, single samples, refine statistical phaseWhatsHap
Long-readPacBio HiFi or Oxford Nanopore reads (>10 kb)Excellent; resolves haplotypes across entire chromosomesDe novo assemblies, structural variants, complex regionsWhatsHap, HapCUT2, HiFiasm
Trio-basedParents + offspring sequencingEssentially perfect where parental genotypes differFamily studies; creates ground-truth phase for benchmarkingWhatsHap --ped
Genetic mapDense reference genetic mapDepends on map quality and LD structureImproves all statistical phasing methods; reduces switch errorsSHAPEIT4 --map

Step 1 — Statistical Phasing with SHAPEIT4

SHAPEIT4 is the current gold standard for statistical phasing of large cohorts. It uses a MCMC algorithm with a Li-Stephens haplotype copying model: each individual's haplotype is modelled as a mosaic of haplotypes from the reference panel (other individuals in the cohort or an external panel).

Li-Stephens Haplotype Copying Model (concept)
P(h | H_ref) = product over sites of P(h_i | h_i copied from some ref haplotype k)
hTarget haplotype being phased (the unknown we are inferring) H_refReference panel -- all other phased haplotypes in the cohort (or an external panel like 1000G) Copying probabilityAt each site, the target haplotype is probabilistically "copied" from the most similar reference haplotype, with occasional copying switches (= recombination events). The model learns which reference haplotypes best explain the observed heterozygous genotype pattern.
Bash -- SHAPEIT4 statistical phasing pipeline
## ================================================================
## SHAPEIT4 statistical phasing
## Prerequisites: SHAPEIT4 v4.2+, bcftools, tabix
## Dataset: Rice 3K chr04 multi-sample VCF
## ================================================================

## Step 0: Pre-flight checks
## SHAPEIT4 requires:
##   1. Multi-sample VCF (biallelic SNPs, genotyped calls -- not missing)
##   2. Genetic map in PLINK format (chr pos_bp cM pos_bp)
##   3. Optionally: a pre-phased reference panel for scaffold phasing

VCF="rice3k_chr04_filtered.vcf.gz"
MAP="chr04_genetic_map_plink.txt"   ## 4-column: chr snpID cM bp
OUTVCF="rice3k_chr04_phased.vcf.gz"

## Step 1: Filter to high-quality biallelic SNPs with <10% missing
bcftools view ${VCF} \
  --min-alleles 2 --max-alleles 2 \
  --type snps \
  --min-ac 1:minor | \
bcftools filter \
  -e 'F_MISSING > 0.1' \
  -o rice3k_chr04_forPhasing.vcf.gz \
  -O z
tabix -p vcf rice3k_chr04_forPhasing.vcf.gz

## Step 2: Run SHAPEIT4
shapeit4.2 \
  --input rice3k_chr04_forPhasing.vcf.gz \
  --map ${MAP} \
  --region chr04 \
  --output ${OUTVCF} \
  --thread 16 \
  --mcmc-iterations 5b,1p,1b,1p,1b,1p,5m \  ## burn-in+pruning+main iterations
  --pbwt-depth 4 \                             ## number of PBWT neighbours to condition on (4-8 typical)
  --pbwt-mac 2 \                               ## min MAC for PBWT conditioning
  --log shapeit4_chr04.log

tabix -p vcf ${OUTVCF}

## Step 3: Verify phasing -- check PS (phase set) tags and switch to | notation
bcftools view -h ${OUTVCF} | grep "FORMAT"
## Phased genotypes are coded as 0|1 or 1|0 (pipe = phased)
## Unphased remaining genotypes are 0/1 (slash = unphased)

## Count phased vs unphased sites
bcftools stats ${OUTVCF} | grep "^PSC"  ## per-sample phasing stats
bcftools view ${OUTVCF} | grep -v "^#" | \
  awk '{for(i=10;i<=NF;i++) if($i~/\|/) phased++; else unphased++}
  END {print "Phased:", phased, "Unphased:", unphased,
       "Rate:", phased/(phased+unphased)}'

## Step 4: Scaffold with reference panel (optional -- improves rare variant phasing)
## Use 1000G phased reference or in-house phased panel as scaffold
shapeit4.2 \
  --input rice3k_chr04_forPhasing.vcf.gz \
  --map ${MAP} \
  --region chr04 \
  --reference 1kg_phased_reference_chr04.vcf.gz \  ## pre-phased reference
  --output rice3k_chr04_phased_scaffolded.vcf.gz \
  --thread 16
Download:

Step 2 — Read-backed Phasing with WhatsHap

WhatsHap phases heterozygous variants that are co-covered by the same sequencing reads. It works on a single individual's BAM + VCF -- no population panel needed. It is ideal for: clinical samples, de novo assemblies, and refining statistical phase in regions of low LD.

WhatsHap Phase Quality (PHRED-scaled)
Phase quality (PQ) = -10 * log10(P_error)
P_errorPosterior probability that the inferred phase assignment is incorrect, given the read evidence PQ = 0No phasing information -- site was not covered by any informative read pair PQ >= 20Phase assignment has <1% error probability -- high confidence PS tagPhase set ID in the VCF FORMAT field -- all sites with the same PS are connected in a haplotype block by read evidence
Bash -- WhatsHap read-backed phasing + polyphase for polyploids
## ================================================================
## WhatsHap read-backed phasing
## Works with short reads, PacBio HiFi, Nanopore, and trio data
## ================================================================

BAM="sample01_sorted_markdup.bam"
VCF="sample01_genotyped.vcf.gz"
REF="IRGSP1.0_genome.fasta"
OUT="sample01_phased.vcf.gz"

## Step 1: Basic read-backed phasing (single individual)
whatshap phase \
  --output ${OUT} \
  --reference ${REF} \
  --indels \            ## also phase insertions/deletions
  --distrust-genotypes \ ## re-genotype using read evidence (improves accuracy)
  ${VCF} ${BAM}
bgzip ${OUT}; tabix -p vcf ${OUT}.gz

## Step 2: Phasing statistics
whatshap stats \
  --gtf phase_blocks.gtf \   ## optional: output phase blocks as GTF for IGV
  ${OUT}.gz

## Step 3: Trio phasing (parents remove switch errors completely)
## Requires: proband VCF, father BAM, mother BAM
whatshap phase \
  --ped pedigree.ped \       ## pedigree file: child father mother sex
  --output trio_phased.vcf.gz \
  --reference ${REF} \
  child.vcf.gz child.bam father.bam mother.bam

## Step 4: Haplotag BAM (tag each read with its inferred haplotype HP:i:1 or HP:i:2)
## Required for haplotype-specific methylation, allele-specific expression
whatshap haplotag \
  --output sample01_haplotagged.bam \
  --reference ${REF} \
  ${OUT}.gz ${BAM}
samtools index sample01_haplotagged.bam
## Now reads carry HP:1 or HP:2 tags -- usable in methylation and ASE analyses

## Step 5: Combine statistical + read-backed phasing
## Best practice: run SHAPEIT4 first (population-level), then refine with WhatsHap
## This gives chromosome-scale blocks with read-level accuracy within blocks
shapeit4.2 --input sample01.vcf.gz --map chr04.map \
  --region chr04 --output sample01_stat_phased.vcf.gz --thread 8

whatshap phase \
  --reference ${REF} \
  --chromosome chr04 \
  --output sample01_combined_phased.vcf.gz \
  sample01_stat_phased.vcf.gz ${BAM}  ## WhatsHap refines the SHAPEIT4 phase

## Step 6: Long-read phasing with HiFi (PacBio)
## No genetic map needed; reads are long enough to bridge most het sites
whatshap phase \
  --output sample01_hifi_phased.vcf.gz \
  --reference ${REF} \
  --indels \
  sample01.vcf.gz sample01_hifi.bam   ## HiFi BAM replaces short-read BAM
Download:

Step 3 — Long-read and Hi-C Phasing

TechnologyRead lengthPhase block N50Pipeline
PacBio HiFi (CCS)10-25 kbChromosome-scale (>10 Mb typical)HiFiasm, WhatsHap, DeepVariant+WhatsHap
Oxford Nanopore (ultra-long)Up to 4 MbChromosome-scalePEPPER-Margin-DeepVariant, Clair3+WhatsHap
Hi-C proximity ligationShort reads (150 bp) but captures long-range contactsChromosome-scaleHapCUT2+HiCPhase, 3D-DNA
10x Genomics Linked ReadsShort reads from the same DNA molecule (60 kb)1-4 MbLongRanger, STEP
Bash -- PacBio HiFi complete assembly phasing with HiFiasm
## ================================================================
## PacBio HiFi chromosome-scale phasing
## HiFiasm produces a phased assembly directly from HiFi reads
## ================================================================

HIFI_READS="sample01_hifi.fastq.gz"
HIC_R1="sample01_hic_R1.fastq.gz"
HIC_R2="sample01_hic_R2.fastq.gz"

## Option A: HiFi-only assembly (haplotype-resolved if heterozygosity >0.5%)
hifiasm -o sample01 -t 32 ${HIFI_READS}
## Outputs: sample01.hic.hap1.p_ctg.gfa, sample01.hic.hap2.p_ctg.gfa

## Option B: Hi-C integrated phasing (chromosome-scale haplotypes)
hifiasm -o sample01_hic -t 32 \
  --h1 ${HIC_R1} \
  --h2 ${HIC_R2} \
  ${HIFI_READS}
## Convert GFA to FASTA
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap1.p_ctg.gfa > hap1.fasta
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap2.p_ctg.gfa > hap2.fasta

## Evaluate phasing quality: QUAST + BUSCO per haplotype
quast.py hap1.fasta hap2.fasta \
  --reference reference.fasta \
  --threads 8 \
  --output-dir quast_haplotypes
busco -i hap1.fasta -o busco_hap1 -l embryophyta_odb10 -m genome -c 16
busco -i hap2.fasta -o busco_hap2 -l embryophyta_odb10 -m genome -c 16

## Map HiFi reads back to determine switch errors vs reference
## Use yak to calculate QV and switch error rate
yak count -b37 -t16 -o pat.yak pat_reads.fastq.gz  ## paternal k-mers
yak count -b37 -t16 -o mat.yak mat_reads.fastq.gz  ## maternal k-mers
yak trioeval pat.yak mat.yak hap1.fasta hap2.fasta
Download:

Step 4 — Haplotype Block Detection

Haplotype blocks (also called LD blocks) are chromosomal regions where recombination is suppressed and alleles are inherited as a unit. Detecting them helps identify haplotype-tag SNPs, understand population history, and design marker panels.

R -- Haplotype block detection with LDBlockShow and ggplot2
## ================================================================
## Haplotype block detection using PLINK and LDBlockShow
## ================================================================
library(dplyr); library(ggplot2)

## Step 1: Compute pairwise LD with PLINK2
## plink2 --vcf phased.vcf.gz --r2-phased --ld-window-kb 500 \
##   --ld-window-r2 0.05 --out chr04_ld_phased --allow-extra-chr --double-id

## Step 2: Define blocks using Gabriel 2002 method in PLINK1.9
## plink --vcf phased.vcf.gz --blocks no-pheno-req --blocks-max-kb 500 \
##   --out chr04_blocks --allow-extra-chr --double-id
## Output: chr04_blocks.blocks.det -- block start, end, n_SNPs, kb

## Step 3: Visualise block structure in R
blocks <- read.table("chr04_blocks.blocks.det", header=TRUE)
blocks$size_kb <- (blocks$BP2 - blocks$BP1) / 1000

## Block size distribution
ggplot(blocks, aes(size_kb)) +
  geom_histogram(bins=60, fill="#2171b5", colour="white", linewidth=0.2) +
  scale_x_log10() +
  labs(x="Block size (kb, log scale)", y="Count",
       title=paste0("Haplotype block size distribution (n=", nrow(blocks), " blocks)")) +
  theme_bw(base_size=12)
ggsave("block_size_distribution.png", width=7, height=5, dpi=200)

cat("Block summary:\n")
cat("  N blocks:", nrow(blocks), "\n")
cat("  Median size (kb):", round(median(blocks$size_kb),1), "\n")
cat("  Max size (kb):", round(max(blocks$size_kb),1), "\n")
cat("  Total covered (Mb):", round(sum(blocks$size_kb)/1000, 1), "\n")

## Step 4: LD heatmap for a candidate region (e.g. Fst peak)
## Use LDheatmap package for focused visualisation
## library(LDheatmap); library(genetics)
## genos <- read.table("candidate_region_genotypes.txt", header=TRUE)
## ldh <- LDheatmap(genos, distances=positions, color=heat.colors(20))
Download:

Step 5 — Phase Quality Evaluation (Switch Error Rate)

Phase switch errors occur when the inferred haplotype assignment switches between the two true haplotypes. Switch error rate (SER) is the key quality metric -- measured by comparing against a known-truth phase (trio, long reads, or a population reference).

Evaluation methodGold standardToolExpected SER
Trio comparisonTrio-phased VCF (parent-offspring)WhatsHap compareSHAPEIT4 on N=1000: ~0.3-0.5% SER
Long-read comparisonHiFi-phased VCF same individualWhatsHap compareSHAPEIT4 on large cohort: ~0.1% SER
Cross-validationLeave-one-out from panelSHAPEIT4 --checkPopulation size matters most
Bash -- Phase switch error evaluation with WhatsHap compare
## Evaluate statistical phase against trio ground truth
whatshap compare \
  --sample SAMPLE01 \
  --names shapeit,trio \
  --tsv-pairwise switch_errors.tsv \
  shapeit4_phased.vcf.gz trio_phased_truth.vcf.gz

## Key output metrics:
## - switch_rate: fraction of consecutive het sites with wrong phase order
## - hamming_rate: fraction of all het sites on the wrong haplotype
## - block_n50: N50 of correct-phase runs

## Parse results
awk -F'\t' 'NR>1 {
  print "Switch error rate:", $6
  print "Hamming error rate:", $7
  print "Phase block N50:", $8
}' switch_errors.tsv
Download:

Step 6 — Applications of Phased Data

ApplicationWhy phasing helpsDownstream tool
Selection scans (iHS, XP-EHH)These statistics are computed from haplotype frequencies -- require phased inputselscan, rehh
ImputationImputation accuracy depends on phased reference panel qualityBeagle5, IMPUTE5, Minimac4
Genomic selectionHaplotype-based prediction outperforms single-SNP models in some scenariosBGLR (haplotype block effects)
Allele-specific expressionAssign RNA-seq reads to parental haplotypes; requires haplotagged BAM from WhatsHapASEReadCounter, WASP
Compound het detectionDistinguish trans (LOF) from cis (one functional copy) mutations in clinical geneticsPharmCAT, manual VCF inspection
Haplotype-based GWASTest haplotype alleles instead of individual SNPs -- more power in low-diversity regionsPLINK --hap, GEMMA

Common Pitfalls

ProblemSymptomFix
Small cohort = high switch errorSHAPEIT4 SER > 2% with N < 50Add a larger reference panel (1000G, HRC) as a phasing scaffold. Use --reference flag in SHAPEIT4. Or use read-backed phasing (WhatsHap) which does not require a large cohort.
Missing genotypes fragment phase blocksMany short phase blocks; low N50Pre-impute missing genotypes with Beagle --impute before phasing with SHAPEIT4. Or filter to <5% missingness per site.
Inbred samples confuse statistical phasingAll sites appear homozygous; near-zero heterozygosityStatistical phasing is nearly irrelevant for highly inbred lines (RILs, inbred strains) where most sites are already homozygous. Focus on the few heterozygous regions using read-backed phasing.
Phasing across centromeres unreliablePhase blocks never span centromereThis is expected and correct -- centromeres suppress recombination so phase connections are rare. Split analyses by chromosome arm if needed.
WhatsHap PQ = 0 for most sitesVery few reads bridge heterozygous pairsRead depth too low (<10x) or insert size too short. Increase to 30x+ for short reads. For sparse variants, long reads are required.