Resolve diploid genotypes into phased haplotypes using SHAPEIT4 (statistical), WhatsHap (read-backed), and long-read methods. Covers haplotype block detection, switch error evaluation, and applications in selection scans and breeding.

When you genotype a diploid individual, you get unphased genotypes: at a heterozygous site you know the individual has alleles A and B, but not which chromosome each allele is on. Phasing resolves this ambiguity, assigning each allele to its parental haplotype (chromosome copy).
| Unphased | Phased (two haplotypes) | Biological meaning |
|---|---|---|
| SNP1: A/B, SNP2: C/D | Hap1: A-C Hap2: B-D | Alleles A and C are inherited together on the same chromosome copy |
| Same genotypes | Hap1: A-D Hap2: B-C | Alternative phase -- A inherited with D on one chromosome |
| Method | Data needed | Accuracy | Best use case | Tool |
|---|---|---|---|---|
| Statistical (population) | Large cohort of unrelated individuals (N > 100 ideal) | High for common variants; degrades for rare variants | Population studies, GWAS, imputation panels | SHAPEIT4, BEAGLE5 |
| Read-backed | Short-read WGS (same sample BAM); works with N=1 | Limited to heterozygous sites on the same read pair (~500 bp) | Clinical samples, single samples, refine statistical phase | WhatsHap |
| Long-read | PacBio HiFi or Oxford Nanopore reads (>10 kb) | Excellent; resolves haplotypes across entire chromosomes | De novo assemblies, structural variants, complex regions | WhatsHap, HapCUT2, HiFiasm |
| Trio-based | Parents + offspring sequencing | Essentially perfect where parental genotypes differ | Family studies; creates ground-truth phase for benchmarking | WhatsHap --ped |
| Genetic map | Dense reference genetic map | Depends on map quality and LD structure | Improves all statistical phasing methods; reduces switch errors | SHAPEIT4 --map |
SHAPEIT4 is the current gold standard for statistical phasing of large cohorts. It uses a MCMC algorithm with a Li-Stephens haplotype copying model: each individual's haplotype is modelled as a mosaic of haplotypes from the reference panel (other individuals in the cohort or an external panel).
## ================================================================
## SHAPEIT4 statistical phasing
## Prerequisites: SHAPEIT4 v4.2+, bcftools, tabix
## Dataset: Rice 3K chr04 multi-sample VCF
## ================================================================
## Step 0: Pre-flight checks
## SHAPEIT4 requires:
## 1. Multi-sample VCF (biallelic SNPs, genotyped calls -- not missing)
## 2. Genetic map in PLINK format (chr pos_bp cM pos_bp)
## 3. Optionally: a pre-phased reference panel for scaffold phasing
VCF="rice3k_chr04_filtered.vcf.gz"
MAP="chr04_genetic_map_plink.txt" ## 4-column: chr snpID cM bp
OUTVCF="rice3k_chr04_phased.vcf.gz"
## Step 1: Filter to high-quality biallelic SNPs with <10% missing
bcftools view ${VCF} \
--min-alleles 2 --max-alleles 2 \
--type snps \
--min-ac 1:minor | \
bcftools filter \
-e 'F_MISSING > 0.1' \
-o rice3k_chr04_forPhasing.vcf.gz \
-O z
tabix -p vcf rice3k_chr04_forPhasing.vcf.gz
## Step 2: Run SHAPEIT4
shapeit4.2 \
--input rice3k_chr04_forPhasing.vcf.gz \
--map ${MAP} \
--region chr04 \
--output ${OUTVCF} \
--thread 16 \
--mcmc-iterations 5b,1p,1b,1p,1b,1p,5m \ ## burn-in+pruning+main iterations
--pbwt-depth 4 \ ## number of PBWT neighbours to condition on (4-8 typical)
--pbwt-mac 2 \ ## min MAC for PBWT conditioning
--log shapeit4_chr04.log
tabix -p vcf ${OUTVCF}
## Step 3: Verify phasing -- check PS (phase set) tags and switch to | notation
bcftools view -h ${OUTVCF} | grep "FORMAT"
## Phased genotypes are coded as 0|1 or 1|0 (pipe = phased)
## Unphased remaining genotypes are 0/1 (slash = unphased)
## Count phased vs unphased sites
bcftools stats ${OUTVCF} | grep "^PSC" ## per-sample phasing stats
bcftools view ${OUTVCF} | grep -v "^#" | \
awk '{for(i=10;i<=NF;i++) if($i~/\|/) phased++; else unphased++}
END {print "Phased:", phased, "Unphased:", unphased,
"Rate:", phased/(phased+unphased)}'
## Step 4: Scaffold with reference panel (optional -- improves rare variant phasing)
## Use 1000G phased reference or in-house phased panel as scaffold
shapeit4.2 \
--input rice3k_chr04_forPhasing.vcf.gz \
--map ${MAP} \
--region chr04 \
--reference 1kg_phased_reference_chr04.vcf.gz \ ## pre-phased reference
--output rice3k_chr04_phased_scaffolded.vcf.gz \
--thread 16
WhatsHap phases heterozygous variants that are co-covered by the same sequencing reads. It works on a single individual's BAM + VCF -- no population panel needed. It is ideal for: clinical samples, de novo assemblies, and refining statistical phase in regions of low LD.
## ================================================================
## WhatsHap read-backed phasing
## Works with short reads, PacBio HiFi, Nanopore, and trio data
## ================================================================
BAM="sample01_sorted_markdup.bam"
VCF="sample01_genotyped.vcf.gz"
REF="IRGSP1.0_genome.fasta"
OUT="sample01_phased.vcf.gz"
## Step 1: Basic read-backed phasing (single individual)
whatshap phase \
--output ${OUT} \
--reference ${REF} \
--indels \ ## also phase insertions/deletions
--distrust-genotypes \ ## re-genotype using read evidence (improves accuracy)
${VCF} ${BAM}
bgzip ${OUT}; tabix -p vcf ${OUT}.gz
## Step 2: Phasing statistics
whatshap stats \
--gtf phase_blocks.gtf \ ## optional: output phase blocks as GTF for IGV
${OUT}.gz
## Step 3: Trio phasing (parents remove switch errors completely)
## Requires: proband VCF, father BAM, mother BAM
whatshap phase \
--ped pedigree.ped \ ## pedigree file: child father mother sex
--output trio_phased.vcf.gz \
--reference ${REF} \
child.vcf.gz child.bam father.bam mother.bam
## Step 4: Haplotag BAM (tag each read with its inferred haplotype HP:i:1 or HP:i:2)
## Required for haplotype-specific methylation, allele-specific expression
whatshap haplotag \
--output sample01_haplotagged.bam \
--reference ${REF} \
${OUT}.gz ${BAM}
samtools index sample01_haplotagged.bam
## Now reads carry HP:1 or HP:2 tags -- usable in methylation and ASE analyses
## Step 5: Combine statistical + read-backed phasing
## Best practice: run SHAPEIT4 first (population-level), then refine with WhatsHap
## This gives chromosome-scale blocks with read-level accuracy within blocks
shapeit4.2 --input sample01.vcf.gz --map chr04.map \
--region chr04 --output sample01_stat_phased.vcf.gz --thread 8
whatshap phase \
--reference ${REF} \
--chromosome chr04 \
--output sample01_combined_phased.vcf.gz \
sample01_stat_phased.vcf.gz ${BAM} ## WhatsHap refines the SHAPEIT4 phase
## Step 6: Long-read phasing with HiFi (PacBio)
## No genetic map needed; reads are long enough to bridge most het sites
whatshap phase \
--output sample01_hifi_phased.vcf.gz \
--reference ${REF} \
--indels \
sample01.vcf.gz sample01_hifi.bam ## HiFi BAM replaces short-read BAM
| Technology | Read length | Phase block N50 | Pipeline |
|---|---|---|---|
| PacBio HiFi (CCS) | 10-25 kb | Chromosome-scale (>10 Mb typical) | HiFiasm, WhatsHap, DeepVariant+WhatsHap |
| Oxford Nanopore (ultra-long) | Up to 4 Mb | Chromosome-scale | PEPPER-Margin-DeepVariant, Clair3+WhatsHap |
| Hi-C proximity ligation | Short reads (150 bp) but captures long-range contacts | Chromosome-scale | HapCUT2+HiCPhase, 3D-DNA |
| 10x Genomics Linked Reads | Short reads from the same DNA molecule (60 kb) | 1-4 Mb | LongRanger, STEP |
## ================================================================
## PacBio HiFi chromosome-scale phasing
## HiFiasm produces a phased assembly directly from HiFi reads
## ================================================================
HIFI_READS="sample01_hifi.fastq.gz"
HIC_R1="sample01_hic_R1.fastq.gz"
HIC_R2="sample01_hic_R2.fastq.gz"
## Option A: HiFi-only assembly (haplotype-resolved if heterozygosity >0.5%)
hifiasm -o sample01 -t 32 ${HIFI_READS}
## Outputs: sample01.hic.hap1.p_ctg.gfa, sample01.hic.hap2.p_ctg.gfa
## Option B: Hi-C integrated phasing (chromosome-scale haplotypes)
hifiasm -o sample01_hic -t 32 \
--h1 ${HIC_R1} \
--h2 ${HIC_R2} \
${HIFI_READS}
## Convert GFA to FASTA
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap1.p_ctg.gfa > hap1.fasta
awk '/^S/{print ">"$2"\n"$3}' sample01_hic.hic.hap2.p_ctg.gfa > hap2.fasta
## Evaluate phasing quality: QUAST + BUSCO per haplotype
quast.py hap1.fasta hap2.fasta \
--reference reference.fasta \
--threads 8 \
--output-dir quast_haplotypes
busco -i hap1.fasta -o busco_hap1 -l embryophyta_odb10 -m genome -c 16
busco -i hap2.fasta -o busco_hap2 -l embryophyta_odb10 -m genome -c 16
## Map HiFi reads back to determine switch errors vs reference
## Use yak to calculate QV and switch error rate
yak count -b37 -t16 -o pat.yak pat_reads.fastq.gz ## paternal k-mers
yak count -b37 -t16 -o mat.yak mat_reads.fastq.gz ## maternal k-mers
yak trioeval pat.yak mat.yak hap1.fasta hap2.fasta
Haplotype blocks (also called LD blocks) are chromosomal regions where recombination is suppressed and alleles are inherited as a unit. Detecting them helps identify haplotype-tag SNPs, understand population history, and design marker panels.
## ================================================================
## Haplotype block detection using PLINK and LDBlockShow
## ================================================================
library(dplyr); library(ggplot2)
## Step 1: Compute pairwise LD with PLINK2
## plink2 --vcf phased.vcf.gz --r2-phased --ld-window-kb 500 \
## --ld-window-r2 0.05 --out chr04_ld_phased --allow-extra-chr --double-id
## Step 2: Define blocks using Gabriel 2002 method in PLINK1.9
## plink --vcf phased.vcf.gz --blocks no-pheno-req --blocks-max-kb 500 \
## --out chr04_blocks --allow-extra-chr --double-id
## Output: chr04_blocks.blocks.det -- block start, end, n_SNPs, kb
## Step 3: Visualise block structure in R
blocks <- read.table("chr04_blocks.blocks.det", header=TRUE)
blocks$size_kb <- (blocks$BP2 - blocks$BP1) / 1000
## Block size distribution
ggplot(blocks, aes(size_kb)) +
geom_histogram(bins=60, fill="#2171b5", colour="white", linewidth=0.2) +
scale_x_log10() +
labs(x="Block size (kb, log scale)", y="Count",
title=paste0("Haplotype block size distribution (n=", nrow(blocks), " blocks)")) +
theme_bw(base_size=12)
ggsave("block_size_distribution.png", width=7, height=5, dpi=200)
cat("Block summary:\n")
cat(" N blocks:", nrow(blocks), "\n")
cat(" Median size (kb):", round(median(blocks$size_kb),1), "\n")
cat(" Max size (kb):", round(max(blocks$size_kb),1), "\n")
cat(" Total covered (Mb):", round(sum(blocks$size_kb)/1000, 1), "\n")
## Step 4: LD heatmap for a candidate region (e.g. Fst peak)
## Use LDheatmap package for focused visualisation
## library(LDheatmap); library(genetics)
## genos <- read.table("candidate_region_genotypes.txt", header=TRUE)
## ldh <- LDheatmap(genos, distances=positions, color=heat.colors(20))
Phase switch errors occur when the inferred haplotype assignment switches between the two true haplotypes. Switch error rate (SER) is the key quality metric -- measured by comparing against a known-truth phase (trio, long reads, or a population reference).
| Evaluation method | Gold standard | Tool | Expected SER |
|---|---|---|---|
| Trio comparison | Trio-phased VCF (parent-offspring) | WhatsHap compare | SHAPEIT4 on N=1000: ~0.3-0.5% SER |
| Long-read comparison | HiFi-phased VCF same individual | WhatsHap compare | SHAPEIT4 on large cohort: ~0.1% SER |
| Cross-validation | Leave-one-out from panel | SHAPEIT4 --check | Population size matters most |
## Evaluate statistical phase against trio ground truth
whatshap compare \
--sample SAMPLE01 \
--names shapeit,trio \
--tsv-pairwise switch_errors.tsv \
shapeit4_phased.vcf.gz trio_phased_truth.vcf.gz
## Key output metrics:
## - switch_rate: fraction of consecutive het sites with wrong phase order
## - hamming_rate: fraction of all het sites on the wrong haplotype
## - block_n50: N50 of correct-phase runs
## Parse results
awk -F'\t' 'NR>1 {
print "Switch error rate:", $6
print "Hamming error rate:", $7
print "Phase block N50:", $8
}' switch_errors.tsv
| Application | Why phasing helps | Downstream tool |
|---|---|---|
| Selection scans (iHS, XP-EHH) | These statistics are computed from haplotype frequencies -- require phased input | selscan, rehh |
| Imputation | Imputation accuracy depends on phased reference panel quality | Beagle5, IMPUTE5, Minimac4 |
| Genomic selection | Haplotype-based prediction outperforms single-SNP models in some scenarios | BGLR (haplotype block effects) |
| Allele-specific expression | Assign RNA-seq reads to parental haplotypes; requires haplotagged BAM from WhatsHap | ASEReadCounter, WASP |
| Compound het detection | Distinguish trans (LOF) from cis (one functional copy) mutations in clinical genetics | PharmCAT, manual VCF inspection |
| Haplotype-based GWAS | Test haplotype alleles instead of individual SNPs -- more power in low-diversity regions | PLINK --hap, GEMMA |
| Problem | Symptom | Fix |
|---|---|---|
| Small cohort = high switch error | SHAPEIT4 SER > 2% with N < 50 | Add a larger reference panel (1000G, HRC) as a phasing scaffold. Use --reference flag in SHAPEIT4. Or use read-backed phasing (WhatsHap) which does not require a large cohort. |
| Missing genotypes fragment phase blocks | Many short phase blocks; low N50 | Pre-impute missing genotypes with Beagle --impute before phasing with SHAPEIT4. Or filter to <5% missingness per site. |
| Inbred samples confuse statistical phasing | All sites appear homozygous; near-zero heterozygosity | Statistical phasing is nearly irrelevant for highly inbred lines (RILs, inbred strains) where most sites are already homozygous. Focus on the few heterozygous regions using read-backed phasing. |
| Phasing across centromeres unreliable | Phase blocks never span centromere | This is expected and correct -- centromeres suppress recombination so phase connections are rare. Split analyses by chromosome arm if needed. |
| WhatsHap PQ = 0 for most sites | Very few reads bridge heterozygous pairs | Read depth too low (<10x) or insert size too short. Increase to 30x+ for short reads. For sparse variants, long reads are required. |