Haplotype Phasing: SHAPEIT4, WhatsHap and Haplotype Blocks

What is Phasing and Why Does It Matter?

When you genotype a diploid individual, you get unphased genotypes: at a heterozygous site you know the individual has alleles A and B, but not which chromosome each allele is on. Phasing resolves this ambiguity, assigning each allele to its parental haplotype (chromosome copy).

Unphased	Phased (two haplotypes)	Biological meaning
SNP1: A/B, SNP2: C/D	Hap1: A-C Hap2: B-D	Alleles A and C are inherited together on the same chromosome copy
Same genotypes	Hap1: A-D Hap2: B-C	Alternative phase -- A inherited with D on one chromosome

Why phasing matters in practice:

Selection scans (iHS/XP-EHH): REQUIRE phased haplotypes. Without phasing these statistics cannot be computed.
Compound heterozygosity: Two mutations in trans (on different chromosomes) = loss-of-function. Two mutations in cis (same chromosome) = one functional copy remains. Clinically critical distinction.
Haplotype-based breeding: Marker-assisted selection and genomic selection work better with haplotype alleles than individual SNP alleles.
Imputation reference panels: All major imputation tools (Beagle, IMPUTE5) require a phased reference panel.

Phasing Methods Compared

Method	Data needed	Accuracy	Best use case	Tool
Statistical (population)	Large cohort of unrelated individuals (N > 100 ideal)	High for common variants; degrades for rare variants	Population studies, GWAS, imputation panels	SHAPEIT4, BEAGLE5
Read-backed	Short-read WGS (same sample BAM); works with N=1	Limited to heterozygous sites on the same read pair (~500 bp)	Clinical samples, single samples, refine statistical phase	WhatsHap
Long-read	PacBio HiFi or Oxford Nanopore reads (>10 kb)	Excellent; resolves haplotypes across entire chromosomes	De novo assemblies, structural variants, complex regions	WhatsHap, HapCUT2, HiFiasm
Trio-based	Parents + offspring sequencing	Essentially perfect where parental genotypes differ	Family studies; creates ground-truth phase for benchmarking	WhatsHap --ped
Genetic map	Dense reference genetic map	Depends on map quality and LD structure	Improves all statistical phasing methods; reduces switch errors	SHAPEIT4 --map

Step 1 — Statistical Phasing with SHAPEIT4

SHAPEIT4 is the current gold standard for statistical phasing of large cohorts. It uses a MCMC algorithm with a Li-Stephens haplotype copying model: each individual's haplotype is modelled as a mosaic of haplotypes from the reference panel (other individuals in the cohort or an external panel).

Li-Stephens Haplotype Copying Model (concept)

P(h | H_ref) = product over sites of P(h_i | h_i copied from some ref haplotype k)

hTarget haplotype being phased (the unknown we are inferring) H_refReference panel -- all other phased haplotypes in the cohort (or an external panel like 1000G) Copying probabilityAt each site, the target haplotype is probabilistically "copied" from the most similar reference haplotype, with occasional copying switches (= recombination events). The model learns which reference haplotypes best explain the observed heterozygous genotype pattern.

Bash -- SHAPEIT4 statistical phasing pipeline

## ================================================================
## SHAPEIT4 statistical phasing
## Prerequisites: SHAPEIT4 v4.2+, bcftools, tabix
## Dataset: Rice 3K chr04 multi-sample VCF
## ================================================================

## Step 0: Pre-flight checks
## SHAPEIT4 requires:
##   1. Multi-sample VCF (biallelic SNPs, genotyped calls -- not missing)
##   2. Genetic map in PLINK format (chr pos_bp cM pos_bp)
##   3. Optionally: a pre-phased reference panel for scaffold phasing

VCF="rice3k_chr04_filtered.vcf.gz"
MAP="chr04_genetic_map_plink.txt"   ## 4-column: chr snpID cM bp
OUTVCF="rice3k_chr04_phased.vcf.gz"

## Step 1: Filter to high-quality biallelic SNPs with <10% missing
bcftools view ${VCF} \
  --min-alleles 2 --max-alleles 2 \
  --type snps \
  --min-ac 1:minor | \
bcftools filter \
  -e 'F_MISSING > 0.1' \
  -o rice3k_chr04_forPhasing.vcf.gz \
  -O z
tabix -p vcf rice3k_chr04_forPhasing.vcf.gz

## Step 2: Run SHAPEIT4
shapeit4.2 \
  --input rice3k_chr04_forPhasing.vcf.gz \
  --map ${MAP} \
  --region chr04 \
  --output ${OUTVCF} \
  --thread 16 \
  --mcmc-iterations 5b,1p,1b,1p,1b,1p,5m \  ## burn-in+pruning+main iterations
  --pbwt-depth 4 \                             ## number of PBWT neighbours to condition on (4-8 typical)
  --pbwt-mac 2 \                               ## min MAC for PBWT conditioning
  --log shapeit4_chr04.log

tabix -p vcf ${OUTVCF}

## Step 3: Verify phasing -- check PS (phase set) tags and switch to | notation
bcftools view -h ${OUTVCF} | grep "FORMAT"
## Phased genotypes are coded as 0|1 or 1|0 (pipe = phased)
## Unphased remaining genotypes are 0/1 (slash = unphased)

## Count phased vs unphased sites
bcftools stats ${OUTVCF} | grep "^PSC"  ## per-sample phasing stats
bcftools view ${OUTVCF} | grep -v "^#" | \
  awk '{for(i=10;i<=NF;i++) if($i~/\|/) phased++; else unphased++}
  END {print "Phased:", phased, "Unphased:", unphased,
       "Rate:", phased/(phased+unphased)}'

## Step 4: Scaffold with reference panel (optional -- improves rare variant phasing)
## Use 1000G phased reference or in-house phased panel as scaffold
shapeit4.2 \
  --input rice3k_chr04_forPhasing.vcf.gz \
  --map ${MAP} \
  --region chr04 \
  --reference 1kg_phased_reference_chr04.vcf.gz \  ## pre-phased reference
  --output rice3k_chr04_phased_scaffolded.vcf.gz \
  --thread 16

Download:

Step 2 — Read-backed Phasing with WhatsHap

WhatsHap phases heterozygous variants that are co-covered by the same sequencing reads. It works on a single individual's BAM + VCF -- no population panel needed. It is ideal for: clinical samples, de novo assemblies, and refining statistical phase in regions of low LD.

WhatsHap Phase Quality (PHRED-scaled)

Phase quality (PQ) = -10 * log10(P_error)

P_errorPosterior probability that the inferred phase assignment is incorrect, given the read evidence PQ = 0No phasing information -- site was not covered by any informative read pair PQ >= 20Phase assignment has <1% error probability -- high confidence PS tagPhase set ID in the VCF FORMAT field -- all sites with the same PS are connected in a haplotype block by read evidence

Bash -- WhatsHap read-backed phasing + polyphase for polyploids

## ================================================================
## WhatsHap read-backed phasing
## Works with short reads, PacBio HiFi, Nanopore, and trio data
## ================================================================

BAM="sample01_sorted_markdup.bam"
VCF="sample01_genotyped.vcf.gz"
REF="IRGSP1.0_genome.fasta"
OUT="sample01_phased.vcf.gz"

## Step 1: Basic read-backed phasing (single individual)
whatshap phase \
  --output ${OUT} \
  --reference ${REF} \
  --indels \            ## also phase insertions/deletions
  --distrust-genotypes \ ## re-genotype using read evidence (improves accuracy)
  ${VCF} ${BAM}
bgzip ${OUT}; tabix -p vcf ${OUT}.gz

## Step 2: Phasing statistics
whatshap stats \
  --gtf phase_blocks.gtf \   ## optional: output phase blocks as GTF for IGV
  ${OUT}.gz

## Step 3: Trio phasing (parents remove switch errors completely)
## Requires: proband VCF, father BAM, mother BAM
whatshap phase \
  --ped pedigree.ped \       ## pedigree file: child father mother sex
  --output trio_phased.vcf.gz \
  --reference ${REF} \
  child.vcf.gz child.bam father.bam mother.bam

## Step 4: Haplotag BAM (tag each read with its inferred haplotype HP:i:1 or HP:i:2)
## Required for haplotype-specific methylation, allele-specific expression
whatshap haplotag \
  --output sample01_haplotagged.bam \
  --reference ${REF} \
  ${OUT}.gz ${BAM}
samtools index sample01_haplotagged.bam
## Now reads carry HP:1 or HP:2 tags -- usable in methylation and ASE analyses

## Step 5: Combine statistical + read-backed phasing
## Best practice: run SHAPEIT4 first (population-level), then refine with WhatsHap
## This gives chromosome-scale blocks with read-level accuracy within blocks
shapeit4.2 --input sample01.vcf.gz --map chr04.map \
  --region chr04 --output sample01_stat_phased.vcf.gz --thread 8

whatshap phase \
  --reference ${REF} \
  --chromosome chr04 \
  --output sample01_combined_phased.vcf.gz \
  sample01_stat_phased.vcf.gz ${BAM}  ## WhatsHap refines the SHAPEIT4 phase

## Step 6: Long-read phasing with HiFi (PacBio)
## No genetic map needed; reads are long enough to bridge most het sites
whatshap phase \
  --output sample01_hifi_phased.vcf.gz \
  --reference ${REF} \
  --indels \
  sample01.vcf.gz sample01_hifi.bam   ## HiFi BAM replaces short-read BAM

Download:

Step 3 — Long-read and Hi-C Phasing

Technology	Read length	Phase block N50	Pipeline
PacBio HiFi (CCS)	10-25 kb	Chromosome-scale (>10 Mb typical)	HiFiasm, WhatsHap, DeepVariant+WhatsHap
Oxford Nanopore (ultra-long)	Up to 4 Mb	Chromosome-scale	PEPPER-Margin-DeepVariant, Clair3+WhatsHap
Hi-C proximity ligation	Short reads (150 bp) but captures long-range contacts	Chromosome-scale	HapCUT2+HiCPhase, 3D-DNA
10x Genomics Linked Reads	Short reads from the same DNA molecule (60 kb)	1-4 Mb	LongRanger, STEP

Bash -- PacBio HiFi complete assembly phasing with HiFiasm

## ================================================================
## PacBio HiFi chromosome-scale phasing
## HiFiasm produces a phased assembly directly from HiFi reads
## ================================================================

HIFI_READS="sample01_hifi.fastq.gz"
HIC_R1="sample01_hic_R1.fastq.gz"
HIC_R2="sample01_hic_R2.fastq.gz"

## Option A: HiFi-only assembly (haplotype-resolved if heterozygosity >0.5%)
hifiasm -o sample01 -t 32 ${HIFI_READS}
## Outputs: sample01.hic.hap1.p_ctg.gfa, sample01.hic.hap2.p_ctg.gfa

## Option B: Hi-C integrated phasing (chromosome-scale haplotypes)
hifiasm -o sample01_hic -t 32 \
  --h1 ${HIC_R1} \
  --h2 ${HIC_R2} \
  ${HIFI_READS}
## Convert GFA to FASTA
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap1.p_ctg.gfa > hap1.fasta
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap2.p_ctg.gfa > hap2.fasta

## Evaluate phasing quality: QUAST + BUSCO per haplotype
quast.py hap1.fasta hap2.fasta \
  --reference reference.fasta \
  --threads 8 \
  --output-dir quast_haplotypes
busco -i hap1.fasta -o busco_hap1 -l embryophyta_odb10 -m genome -c 16
busco -i hap2.fasta -o busco_hap2 -l embryophyta_odb10 -m genome -c 16

## Map HiFi reads back to determine switch errors vs reference
## Use yak to calculate QV and switch error rate
yak count -b37 -t16 -o pat.yak pat_reads.fastq.gz  ## paternal k-mers
yak count -b37 -t16 -o mat.yak mat_reads.fastq.gz  ## maternal k-mers
yak trioeval pat.yak mat.yak hap1.fasta hap2.fasta

Download:

Step 4 — Haplotype Block Detection

Haplotype blocks (also called LD blocks) are chromosomal regions where recombination is suppressed and alleles are inherited as a unit. Detecting them helps identify haplotype-tag SNPs, understand population history, and design marker panels.

R -- Haplotype block detection with LDBlockShow and ggplot2

## ================================================================
## Haplotype block detection using PLINK and LDBlockShow
## ================================================================
library(dplyr); library(ggplot2)

## Step 1: Compute pairwise LD with PLINK2
## plink2 --vcf phased.vcf.gz --r2-phased --ld-window-kb 500 \
##   --ld-window-r2 0.05 --out chr04_ld_phased --allow-extra-chr --double-id

## Step 2: Define blocks using Gabriel 2002 method in PLINK1.9
## plink --vcf phased.vcf.gz --blocks no-pheno-req --blocks-max-kb 500 \
##   --out chr04_blocks --allow-extra-chr --double-id
## Output: chr04_blocks.blocks.det -- block start, end, n_SNPs, kb

## Step 3: Visualise block structure in R
blocks <- read.table("chr04_blocks.blocks.det", header=TRUE)
blocks$size_kb <- (blocks$BP2 - blocks$BP1) / 1000

## Block size distribution
ggplot(blocks, aes(size_kb)) +
  geom_histogram(bins=60, fill="#2171b5", colour="white", linewidth=0.2) +
  scale_x_log10() +
  labs(x="Block size (kb, log scale)", y="Count",
       title=paste0("Haplotype block size distribution (n=", nrow(blocks), " blocks)")) +
  theme_bw(base_size=12)
ggsave("block_size_distribution.png", width=7, height=5, dpi=200)

cat("Block summary:\n")
cat("  N blocks:", nrow(blocks), "\n")
cat("  Median size (kb):", round(median(blocks$size_kb),1), "\n")
cat("  Max size (kb):", round(max(blocks$size_kb),1), "\n")
cat("  Total covered (Mb):", round(sum(blocks$size_kb)/1000, 1), "\n")

## Step 4: LD heatmap for a candidate region (e.g. Fst peak)
## Use LDheatmap package for focused visualisation
## library(LDheatmap); library(genetics)
## genos <- read.table("candidate_region_genotypes.txt", header=TRUE)
## ldh <- LDheatmap(genos, distances=positions, color=heat.colors(20))

Download:

Step 5 — Phase Quality Evaluation (Switch Error Rate)

Phase switch errors occur when the inferred haplotype assignment switches between the two true haplotypes. Switch error rate (SER) is the key quality metric -- measured by comparing against a known-truth phase (trio, long reads, or a population reference).

Evaluation method	Gold standard	Tool	Expected SER
Trio comparison	Trio-phased VCF (parent-offspring)	WhatsHap compare	SHAPEIT4 on N=1000: ~0.3-0.5% SER
Long-read comparison	HiFi-phased VCF same individual	WhatsHap compare	SHAPEIT4 on large cohort: ~0.1% SER
Cross-validation	Leave-one-out from panel	SHAPEIT4 --check	Population size matters most

Bash -- Phase switch error evaluation with WhatsHap compare

## Evaluate statistical phase against trio ground truth
whatshap compare \
  --sample SAMPLE01 \
  --names shapeit,trio \
  --tsv-pairwise switch_errors.tsv \
  shapeit4_phased.vcf.gz trio_phased_truth.vcf.gz

## Key output metrics:
## - switch_rate: fraction of consecutive het sites with wrong phase order
## - hamming_rate: fraction of all het sites on the wrong haplotype
## - block_n50: N50 of correct-phase runs

## Parse results
awk -F'\t' 'NR>1 {
  print "Switch error rate:", $6
  print "Hamming error rate:", $7
  print "Phase block N50:", $8
}' switch_errors.tsv

Download:

Step 6 — Applications of Phased Data

Application	Why phasing helps	Downstream tool
Selection scans (iHS, XP-EHH)	These statistics are computed from haplotype frequencies -- require phased input	selscan, rehh
Imputation	Imputation accuracy depends on phased reference panel quality	Beagle5, IMPUTE5, Minimac4
Genomic selection	Haplotype-based prediction outperforms single-SNP models in some scenarios	BGLR (haplotype block effects)
Allele-specific expression	Assign RNA-seq reads to parental haplotypes; requires haplotagged BAM from WhatsHap	ASEReadCounter, WASP
Compound het detection	Distinguish trans (LOF) from cis (one functional copy) mutations in clinical genetics	PharmCAT, manual VCF inspection
Haplotype-based GWAS	Test haplotype alleles instead of individual SNPs -- more power in low-diversity regions	PLINK --hap, GEMMA

Common Pitfalls

Problem	Symptom	Fix
Small cohort = high switch error	SHAPEIT4 SER > 2% with N < 50	Add a larger reference panel (1000G, HRC) as a phasing scaffold. Use --reference flag in SHAPEIT4. Or use read-backed phasing (WhatsHap) which does not require a large cohort.
Missing genotypes fragment phase blocks	Many short phase blocks; low N50	Pre-impute missing genotypes with Beagle --impute before phasing with SHAPEIT4. Or filter to <5% missingness per site.
Inbred samples confuse statistical phasing	All sites appear homozygous; near-zero heterozygosity	Statistical phasing is nearly irrelevant for highly inbred lines (RILs, inbred strains) where most sites are already homozygous. Focus on the few heterozygous regions using read-backed phasing.
Phasing across centromeres unreliable	Phase blocks never span centromere	This is expected and correct -- centromeres suppress recombination so phase connections are rare. Split analyses by chromosome arm if needed.
WhatsHap PQ = 0 for most sites	Very few reads bridge heterozygous pairs	Read depth too low (<10x) or insert size too short. Increase to 30x+ for short reads. For sparse variants, long reads are required.

All Tutorials CNV Detection

Haplotype Phasing

On this page

What is Phasing and Why Does It Matter?

Phasing Methods Compared

Step 1 — Statistical Phasing with SHAPEIT4

Step 2 — Read-backed Phasing with WhatsHap

Step 3 — Long-read and Hi-C Phasing

Step 4 — Haplotype Block Detection

Step 5 — Phase Quality Evaluation (Switch Error Rate)

Step 6 — Applications of Phased Data

Common Pitfalls