Detect deletions, duplications, and complex structural variants from WGS data. Covers read-depth analysis, split-read and paired-end methods, CNVnator, Manta, LUMPY, normalization formulas, and live depth-ratio calculators.

Copy Number Variations (CNVs) are genomic regions that are deleted or duplicated relative to a reference genome. They range from a few hundred base pairs to entire chromosome arms, and account for a larger fraction of genetic diversity between individuals than SNPs.
| CNV Type | Copy number change | Effect | Detection signal |
|---|---|---|---|
| Deletion (DEL) | Loss of one or both copies (CN 0 or 1 vs expected 2) | Haploinsufficiency, loss of function | Decreased read depth, paired-end discordance |
| Duplication (DUP) | Gain of one or more copies (CN 3+) | Gene dosage effect, neo-functionalization | Increased read depth, tandem split reads |
| Inversion (INV) | No copy change; orientation flipped | Gene disruption if breakpoint in gene; position effect | Paired-end orientation discordance |
| Translocation (TRA/BND) | Chromosomal segment moves to new location | Fusion genes, altered regulation | Inter-chromosomal paired-end discordance |
| Complex SV | Multiple rearrangements at same locus | Variable; often severe | Combination of all signals; difficult to resolve |
| Signal | Abbreviation | How it works | Best for |
|---|---|---|---|
| Read Depth (RD) | RD/RC | Count reads per bin; normalize; segment into copy number states. Deleted regions have half the normal coverage; duplicated regions have 1.5x or 2x. | Large CNVs (>1 kb); tandem duplications; reliable for copy number estimation |
| Paired-End (PE) | DP/SR | Read pairs that map too far apart, too close, or in wrong orientation signal a structural variant at their mapping positions. | Breakpoint detection; inversions; translocations |
| Split Read (SR) | SR/SA | Reads that span a breakpoint are split: one part maps to one location, the other maps elsewhere. Base-pair resolution breakpoints. | Precise breakpoint calling; small SVs; insertion detection |
| Local Assembly (LA) | AS | Assemble reads de novo in candidate regions; compare assembled contigs to reference to detect variants directly. | Complex SVs; mobile elements; most accurate method |
Read-depth CNV detection normalises coverage per bin and then segments the normalised log-ratio to identify copy-number states. Understanding the normalisation formula helps you interpret and troubleshoot results.
## ================================================================
## CNVnator v0.4.1+ -- read-depth segmentation
## Works for germline CNVs in diploid organisms
## Required: sorted, duplicate-marked BAM; reference FASTA split by chr
## ================================================================
BAM="sample01_sorted_markdup.bam"
REF_DIR="reference_chromosomes/" ## one FASTA file per chromosome
SAMPLE="sample01"
BIN=500 ## bin size in bp: 100-500 bp for 30x WGS
## Step 1: Extract read depth histogram from BAM
cnvnator -root ${SAMPLE}.root \
-tree ${BAM} \
-chrom $(seq -s' ' 1 12 | sed 's/^/chr/g') ## specify chromosomes
## Step 2: Generate read depth histograms (one per bin size)
cnvnator -root ${SAMPLE}.root \
-his ${BIN} \
-d ${REF_DIR} ## reference chromosome files for GC correction
## Step 3: Compute statistics per bin (mean, variance)
cnvnator -root ${SAMPLE}.root \
-stat ${BIN}
## Step 4: RD signal partitioning (segmentation into CN states)
cnvnator -root ${SAMPLE}.root \
-partition ${BIN}
## Step 5: Call CNVs from segments
cnvnator -root ${SAMPLE}.root \
-call ${BIN} > ${SAMPLE}_cnvnator_${BIN}bp.txt
## Output format:
## CNV_type chr:start-end size normalised_RD e-value1 e-value2 q0
## CNV_type: deletion/duplication
## normalised_RD: ~1.0 = het del, ~0 = hom del, ~1.5 = het dup
## e-value1: based on t-test; e-value2: based on Gaussian mixture model
## q0: fraction of reads mapping to MAPQ=0 (high q0 = repetitive region)
## Filter: keep high-quality calls
awk '$8 < 0.5' ${SAMPLE}_cnvnator_${BIN}bp.txt > ${SAMPLE}_cnvnator_highQ.txt
## q0 < 0.5 means fewer than 50% of reads have MAPQ=0 (non-repetitive)
## Convert to BED format for downstream intersect
awk '{
split($2, a, /[:-]/);
print a[1], a[2], a[3], $1, $3, $4
}' OFS="\t" ${SAMPLE}_cnvnator_highQ.txt > ${SAMPLE}_cnvnator.bed
echo "CNVnator calls: $(wc -l < ${SAMPLE}_cnvnator_highQ.txt)"
echo "Deletions: $(grep deletion ${SAMPLE}_cnvnator_highQ.txt | wc -l)"
echo "Duplications: $(grep duplication ${SAMPLE}_cnvnator_highQ.txt | wc -l)"
## ================================================================
## Manta v1.6+ -- split-read + paired-end SV calling
## Manta uses local assembly for precise breakpoints
## ================================================================
BAM="sample01_sorted_markdup.bam"
REF="IRGSP1.0_genome.fasta"
OUTDIR="manta_output"
## Step 1: Configure Manta
configManta.py \
--bam ${BAM} \
--referenceFasta ${REF} \
--runDir ${OUTDIR}
## Step 2: Run Manta (uses all available cores automatically)
${OUTDIR}/runWorkflow.py -j 16 -m local
## Main outputs:
## candidateSV.vcf.gz: all candidate SVs (lower quality filter)
## diploidSV.vcf.gz: high-quality diploid SVs (use this)
## candidateSmallIndels.vcf.gz: for GATK HaplotypeCaller as indel candidates
## Step 3: Filter Manta output
bcftools view \
-f PASS \ ## only PASS filter variants
-i 'SVLEN >= 50 || SVTYPE == "BND"' \ ## SVs >= 50 bp; all translocations
${OUTDIR}/results/variants/diploidSV.vcf.gz | \
bcftools sort | bgzip -c > manta_filtered.vcf.gz
tabix -p vcf manta_filtered.vcf.gz
## ================================================================
## SMOOVE (wrapper around LUMPY) -- population-level SV genotyping
## Best for cohort studies; genotypes SVs across all samples jointly
## ================================================================
## Step 1: Call SVs per sample
for BAM_FILE in *.bam; do
SAMPLE_NAME=$(basename ${BAM_FILE} _sorted_markdup.bam)
smoove call \
--outdir smoove_per_sample \
--name ${SAMPLE_NAME} \
--fasta ${REF} \
-p 8 \
--genotype \
${BAM_FILE}
done
## Step 2: Merge all sample VCFs into a unified site list
smoove merge \
--name rice3k \
--fasta ${REF} \
--outdir smoove_merged \
smoove_per_sample/*.genotyped.vcf.gz
## Step 3: Re-genotype all samples at merged sites
for VCF_FILE in smoove_per_sample/*.genotyped.vcf.gz; do
SAMPLE_NAME=$(basename ${VCF_FILE} .genotyped.vcf.gz)
smoove genotype \
--duphold \ ## annotate with depth-based quality
--name ${SAMPLE_NAME} \
--outdir smoove_genotyped \
--fasta ${REF} \
--vcf smoove_merged/rice3k.sites.vcf.gz \
${SAMPLE_NAME}_sorted_markdup.bam
done
## Step 4: Paste into multi-sample VCF
smoove paste \
--name rice3k_cohort \
--outdir smoove_final \
smoove_genotyped/*.vcf.gz
## Output: rice3k_cohort.smoove.square.vcf.gz
## This is a multi-sample SV VCF with genotypes at all sites in all samples
## ================================================================
## Merge CNVnator (RD) + Manta (SR+PE) calls
## Variants supported by both methods = high-confidence CNVs
## ================================================================
## Convert Manta VCF to BED (for bedtools intersect)
bcftools view -i 'SVTYPE=="DEL" || SVTYPE=="DUP"' manta_filtered.vcf.gz | \
bcftools query \
-f '%CHROM\t%POS\t%END\t%SVTYPE\t%SVLEN\n' | \
awk '{if($5<0) $5=-$5; print}' OFS="\t" > manta_deldups.bed
## CNVnator BED already created in step 2
## Intersect: require 50% reciprocal overlap
bedtools intersect \
-a cnvnator_filtered.bed \
-b manta_deldups.bed \
-f 0.5 -r \ ## 50% reciprocal overlap
-wa -wb \
> confirmed_cnvs.bed
echo "CNVnator calls: $(wc -l < cnvnator_filtered.bed)"
echo "Manta calls: $(wc -l < manta_deldups.bed)"
echo "Confirmed by both (50% overlap): $(wc -l < confirmed_cnvs.bed)"
## ================================================================
## R: CNV quality filter and annotation
## ================================================================
library(dplyr)
cnvs <- read.table("confirmed_cnvs.bed", sep="\t",
col.names=c("chr","start","end","type","size","norm_RD",
"chr2","start2","end2","type2","size2"))
cnvs$size_kb <- cnvs$size / 1000
## Quality filters:
## 1. Remove CNVs in repetitive regions (high q0)
## 2. Remove very small CNVs (< 1 kb) -- likely noise
## 3. Remove CNVs in centromeres/telomeres
cnvs_filtered <- cnvs %>%
filter(size_kb >= 1) %>%
filter(!grepl("chrUn|random|scaffold", chr))
cat("High-confidence CNVs after filtering:", nrow(cnvs_filtered), "\n")
cat(" Deletions:", sum(cnvs_filtered$type == "deletion"), "\n")
cat(" Duplications:", sum(cnvs_filtered$type == "duplication"), "\n")
cat(" Size range:", round(min(cnvs_filtered$size_kb),1), "-",
round(max(cnvs_filtered$size_kb),1), "kb\n")
## ================================================================
## Genotype CNV sites across a population
## Two approaches: SVTyper (PE+SR) and duphold (depth validation)
## ================================================================
## Method A: SVTyper -- genotype SV sites in each sample
## First create a merged site VCF from discovery set
for BAM in *.bam; do
SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
svtyper \
--bam ${BAM} \
--vcf confirmed_svs.vcf \ ## merged discovery VCF
--output ${SAMPLE}_svtyped.vcf
done
## Method B: duphold -- annotate depth fold-change at each SV
## duphold integrates read-depth evidence into any SV VCF
for BAM in *.bam; do
SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
duphold \
--threads 8 \
--vcf confirmed_svs.vcf \
--bam ${BAM} \
--fasta reference.fasta \
--output ${SAMPLE}_duphold.bcf
bcftools convert -O z ${SAMPLE}_duphold.bcf > ${SAMPLE}_duphold.vcf.gz
tabix -p vcf ${SAMPLE}_duphold.vcf.gz
done
## duphold key annotations added to FORMAT field:
## DHFC: fold-change vs flanking regions (expected: DEL < 0.7, DUP > 1.3)
## DHFFC: fold-change vs flanking + GC correction
## DHBFC: fold-change in background windows (for homozygous deletions)
## Filter using duphold annotations (removes false positives efficiently)
bcftools filter \
-i '(SVTYPE=="DEL" && FMT/DHFC < 0.7) || (SVTYPE=="DUP" && FMT/DHFC > 1.3)' \
merged_genotyped.vcf.gz > depth_confirmed_svs.vcf
library(ggplot2); library(dplyr)
## Load per-bin read depth (output from mosdepth --quantize)
depth <- read.table("sample01.regions.bed.gz", header=FALSE,
col.names=c("chr","start","end","depth"))
depth$pos_mb <- (depth$start + depth$end) / 2 / 1e6
mean_depth <- mean(depth$depth)
depth$log2r <- log2(depth$depth / mean_depth + 0.001) ## pseudocount
## Genome-wide CNV profile (one panel per chromosome)
ggplot(depth %>% filter(grepl("^chr[0-9]", chr)), aes(pos_mb, log2r)) +
geom_point(size=0.2, alpha=0.3, colour="#2171b5") +
geom_hline(yintercept=0, colour="grey50", linewidth=0.4) +
geom_hline(yintercept=c(-1, 0.585, 1), linetype="dashed",
colour=c("red","orange","red"), linewidth=0.3) +
facet_wrap(~chr, ncol=4, scales="free_x") +
scale_y_continuous(limits=c(-3, 2.5)) +
labs(x="Position (Mb)", y="Log2 copy ratio",
title="Genome-wide copy number profile",
subtitle="Dashed lines: -1=het del, +0.585=het dup, +1=hom dup") +
theme_minimal(base_size=9)
ggsave("cnv_genome_profile.png", width=14, height=10, dpi=200)
## CNV size distribution
cnvs <- read.table("confirmed_cnvs.bed", header=FALSE)
colnames(cnvs) <- c("chr","start","end","type","size")
cnvs$size_kb <- cnvs$size / 1000
cnvs$type2 <- ifelse(grepl("deletion", cnvs$type, ignore.case=TRUE), "Deletion", "Duplication")
ggplot(cnvs, aes(size_kb, fill=type2)) +
geom_histogram(bins=50, colour="white", linewidth=0.2, position="identity", alpha=0.7) +
scale_x_log10(name="CNV size (kb, log scale)") +
scale_fill_manual(values=c("Deletion"="#e41a1c","Duplication"="#377eb8")) +
labs(y="Count", fill="CNV type", title="CNV size distribution") +
theme_bw(base_size=12)
ggsave("cnv_size_distribution.png", width=7, height=4, dpi=200)
| Problem | Symptom | Fix |
|---|---|---|
| Repetitive regions inflate CNV calls | Hundreds of CNVs in satellite DNA, centromeres, telomeres | Filter by q0 > 0.5 in CNVnator output. Mask repeat regions with bedtools subtract using RepeatMasker BED. Check CNV coordinates in UCSC Repeat Masker track. |
| GC bias creates false CNV pattern | Systematic deletions in high-GC promoters; duplications in AT-rich regions | CNVnator and Control-FREEC apply GC correction. For custom pipelines use LOESS regression: fit depth ~ GC%, then divide each bin depth by the predicted value. |
| Unequal sequencing depth between samples | Sample with lower depth has more false deletion calls | Always normalise to genome-wide mean depth per sample, not globally. Never compare raw read counts between samples. |
| PCR duplicates inflate duplication calls | Tandem duplications appear larger or at higher copy number | Always mark and exclude PCR duplicates (Picard MarkDuplicates) before any CNV analysis. For amplicon libraries, use UMI-aware deduplication. |
| Manta misses tandem duplications | Large tandem dups detected by CNVnator but not Manta | Tandem duplications are harder for split-read callers because both breakpoints are at similar sequences. Use CNVnator + Manta union (OR) for duplications, but AND (intersection) for deletions. |