Copy Number Variation (CNV) Detection

Detect deletions, duplications, and complex structural variants from WGS data. Covers read-depth analysis, split-read and paired-end methods, CNVnator, Manta, LUMPY, normalization formulas, and live depth-ratio calculators.

~120 min Intermediate-Advanced Live spreadsheets CNVnator / Manta / R
Byte

CNV Concepts

Copy Number Variations (CNVs) are genomic regions that are deleted or duplicated relative to a reference genome. They range from a few hundred base pairs to entire chromosome arms, and account for a larger fraction of genetic diversity between individuals than SNPs.

CNV TypeCopy number changeEffectDetection signal
Deletion (DEL)Loss of one or both copies (CN 0 or 1 vs expected 2)Haploinsufficiency, loss of functionDecreased read depth, paired-end discordance
Duplication (DUP)Gain of one or more copies (CN 3+)Gene dosage effect, neo-functionalizationIncreased read depth, tandem split reads
Inversion (INV)No copy change; orientation flippedGene disruption if breakpoint in gene; position effectPaired-end orientation discordance
Translocation (TRA/BND)Chromosomal segment moves to new locationFusion genes, altered regulationInter-chromosomal paired-end discordance
Complex SVMultiple rearrangements at same locusVariable; often severeCombination of all signals; difficult to resolve

Four Detection Signals from WGS

SignalAbbreviationHow it worksBest for
Read Depth (RD)RD/RCCount reads per bin; normalize; segment into copy number states. Deleted regions have half the normal coverage; duplicated regions have 1.5x or 2x.Large CNVs (>1 kb); tandem duplications; reliable for copy number estimation
Paired-End (PE)DP/SRRead pairs that map too far apart, too close, or in wrong orientation signal a structural variant at their mapping positions.Breakpoint detection; inversions; translocations
Split Read (SR)SR/SAReads that span a breakpoint are split: one part maps to one location, the other maps elsewhere. Base-pair resolution breakpoints.Precise breakpoint calling; small SVs; insertion detection
Local Assembly (LA)ASAssemble reads de novo in candidate regions; compare assembled contigs to reference to detect variants directly.Complex SVs; mobile elements; most accurate method
Byte
Always use at least two independent signals. Single-signal CNV callers have false positive rates of 30-50%. The standard best practice is to intersect calls from a read-depth caller (CNVnator) with a split-read/paired-end caller (Manta or DELLY). Variants supported by both methods have false positive rates below 5%. For clinical-grade calls, add local assembly (Manta's assembled contigs) as a third line of evidence.

Step 1 — Read-Depth Analysis: The Math

Read-depth CNV detection normalises coverage per bin and then segments the normalised log-ratio to identify copy-number states. Understanding the normalisation formula helps you interpret and troubleshoot results.

Read-Depth CNV Formulas
Normalised depth = (bin_depth / mean_depth) * 2
Log2 ratio = log2(normalised_depth / 2) = log2(bin_depth / mean_depth)
Expected copy number = 2 * (bin_depth / mean_depth)
Log2 ratio = 0Normal diploid (copy number = 2) Log2 ratio = -1Heterozygous deletion (one copy deleted, CN = 1) Log2 ratio = -InfHomozygous deletion (both copies deleted, CN = 0) Log2 ratio = +0.585Heterozygous duplication (one extra copy, CN = 3) Log2 ratio = +1.0Homozygous duplication (CN = 4)

Step 2 — CNVnator Read-Depth Pipeline

Bash -- CNVnator read-depth CNV calling pipeline
## ================================================================
## CNVnator v0.4.1+ -- read-depth segmentation
## Works for germline CNVs in diploid organisms
## Required: sorted, duplicate-marked BAM; reference FASTA split by chr
## ================================================================

BAM="sample01_sorted_markdup.bam"
REF_DIR="reference_chromosomes/"   ## one FASTA file per chromosome
SAMPLE="sample01"
BIN=500    ## bin size in bp: 100-500 bp for 30x WGS

## Step 1: Extract read depth histogram from BAM
cnvnator -root ${SAMPLE}.root \
  -tree ${BAM} \
  -chrom $(seq -s' ' 1 12 | sed 's/^/chr/g') ## specify chromosomes

## Step 2: Generate read depth histograms (one per bin size)
cnvnator -root ${SAMPLE}.root \
  -his ${BIN} \
  -d ${REF_DIR}    ## reference chromosome files for GC correction

## Step 3: Compute statistics per bin (mean, variance)
cnvnator -root ${SAMPLE}.root \
  -stat ${BIN}

## Step 4: RD signal partitioning (segmentation into CN states)
cnvnator -root ${SAMPLE}.root \
  -partition ${BIN}

## Step 5: Call CNVs from segments
cnvnator -root ${SAMPLE}.root \
  -call ${BIN} > ${SAMPLE}_cnvnator_${BIN}bp.txt

## Output format:
## CNV_type chr:start-end size normalised_RD e-value1 e-value2 q0
## CNV_type: deletion/duplication
## normalised_RD: ~1.0 = het del, ~0 = hom del, ~1.5 = het dup
## e-value1: based on t-test; e-value2: based on Gaussian mixture model
## q0: fraction of reads mapping to MAPQ=0 (high q0 = repetitive region)

## Filter: keep high-quality calls
awk '$8 < 0.5' ${SAMPLE}_cnvnator_${BIN}bp.txt > ${SAMPLE}_cnvnator_highQ.txt
## q0 < 0.5 means fewer than 50% of reads have MAPQ=0 (non-repetitive)

## Convert to BED format for downstream intersect
awk '{
  split($2, a, /[:-]/);
  print a[1], a[2], a[3], $1, $3, $4
}' OFS="\t" ${SAMPLE}_cnvnator_highQ.txt > ${SAMPLE}_cnvnator.bed

echo "CNVnator calls: $(wc -l < ${SAMPLE}_cnvnator_highQ.txt)"
echo "Deletions: $(grep deletion ${SAMPLE}_cnvnator_highQ.txt | wc -l)"
echo "Duplications: $(grep duplication ${SAMPLE}_cnvnator_highQ.txt | wc -l)"
Download:

Step 3 — SV Callers: Manta and LUMPY

Bash -- Manta (split-read + PE) and SMOOVE/LUMPY structural variant calling
## ================================================================
## Manta v1.6+ -- split-read + paired-end SV calling
## Manta uses local assembly for precise breakpoints
## ================================================================

BAM="sample01_sorted_markdup.bam"
REF="IRGSP1.0_genome.fasta"
OUTDIR="manta_output"

## Step 1: Configure Manta
configManta.py \
  --bam ${BAM} \
  --referenceFasta ${REF} \
  --runDir ${OUTDIR}

## Step 2: Run Manta (uses all available cores automatically)
${OUTDIR}/runWorkflow.py -j 16 -m local

## Main outputs:
## candidateSV.vcf.gz: all candidate SVs (lower quality filter)
## diploidSV.vcf.gz: high-quality diploid SVs (use this)
## candidateSmallIndels.vcf.gz: for GATK HaplotypeCaller as indel candidates

## Step 3: Filter Manta output
bcftools view \
  -f PASS \                              ## only PASS filter variants
  -i 'SVLEN >= 50 || SVTYPE == "BND"' \ ## SVs >= 50 bp; all translocations
  ${OUTDIR}/results/variants/diploidSV.vcf.gz | \
  bcftools sort | bgzip -c > manta_filtered.vcf.gz
tabix -p vcf manta_filtered.vcf.gz

## ================================================================
## SMOOVE (wrapper around LUMPY) -- population-level SV genotyping
## Best for cohort studies; genotypes SVs across all samples jointly
## ================================================================

## Step 1: Call SVs per sample
for BAM_FILE in *.bam; do
  SAMPLE_NAME=$(basename ${BAM_FILE} _sorted_markdup.bam)
  smoove call \
    --outdir smoove_per_sample \
    --name ${SAMPLE_NAME} \
    --fasta ${REF} \
    -p 8 \
    --genotype \
    ${BAM_FILE}
done

## Step 2: Merge all sample VCFs into a unified site list
smoove merge \
  --name rice3k \
  --fasta ${REF} \
  --outdir smoove_merged \
  smoove_per_sample/*.genotyped.vcf.gz

## Step 3: Re-genotype all samples at merged sites
for VCF_FILE in smoove_per_sample/*.genotyped.vcf.gz; do
  SAMPLE_NAME=$(basename ${VCF_FILE} .genotyped.vcf.gz)
  smoove genotype \
    --duphold \                          ## annotate with depth-based quality
    --name ${SAMPLE_NAME} \
    --outdir smoove_genotyped \
    --fasta ${REF} \
    --vcf smoove_merged/rice3k.sites.vcf.gz \
    ${SAMPLE_NAME}_sorted_markdup.bam
done

## Step 4: Paste into multi-sample VCF
smoove paste \
  --name rice3k_cohort \
  --outdir smoove_final \
  smoove_genotyped/*.vcf.gz

## Output: rice3k_cohort.smoove.square.vcf.gz
## This is a multi-sample SV VCF with genotypes at all sites in all samples
Download:

Step 4 — Merging and Filtering Callers

Bash + R -- Intersect CNVnator and Manta calls; calculate false positive rates
## ================================================================
## Merge CNVnator (RD) + Manta (SR+PE) calls
## Variants supported by both methods = high-confidence CNVs
## ================================================================

## Convert Manta VCF to BED (for bedtools intersect)
bcftools view -i 'SVTYPE=="DEL" || SVTYPE=="DUP"' manta_filtered.vcf.gz | \
  bcftools query \
    -f '%CHROM\t%POS\t%END\t%SVTYPE\t%SVLEN\n' | \
  awk '{if($5<0) $5=-$5; print}' OFS="\t" > manta_deldups.bed

## CNVnator BED already created in step 2
## Intersect: require 50% reciprocal overlap
bedtools intersect \
  -a cnvnator_filtered.bed \
  -b manta_deldups.bed \
  -f 0.5 -r \                ## 50% reciprocal overlap
  -wa -wb \
  > confirmed_cnvs.bed

echo "CNVnator calls: $(wc -l < cnvnator_filtered.bed)"
echo "Manta calls: $(wc -l < manta_deldups.bed)"
echo "Confirmed by both (50% overlap): $(wc -l < confirmed_cnvs.bed)"

## ================================================================
## R: CNV quality filter and annotation
## ================================================================
library(dplyr)
cnvs <- read.table("confirmed_cnvs.bed", sep="\t",
  col.names=c("chr","start","end","type","size","norm_RD",
              "chr2","start2","end2","type2","size2"))
cnvs$size_kb <- cnvs$size / 1000

## Quality filters:
## 1. Remove CNVs in repetitive regions (high q0)
## 2. Remove very small CNVs (< 1 kb) -- likely noise
## 3. Remove CNVs in centromeres/telomeres
cnvs_filtered <- cnvs %>%
  filter(size_kb >= 1) %>%
  filter(!grepl("chrUn|random|scaffold", chr))

cat("High-confidence CNVs after filtering:", nrow(cnvs_filtered), "\n")
cat("  Deletions:", sum(cnvs_filtered$type == "deletion"), "\n")
cat("  Duplications:", sum(cnvs_filtered$type == "duplication"), "\n")
cat("  Size range:", round(min(cnvs_filtered$size_kb),1), "-",
    round(max(cnvs_filtered$size_kb),1), "kb\n")
Download:

Step 5 — CNV Genotyping Across a Population

Bash -- Genotype CNV sites in all samples using SVTyper and duphold
## ================================================================
## Genotype CNV sites across a population
## Two approaches: SVTyper (PE+SR) and duphold (depth validation)
## ================================================================

## Method A: SVTyper -- genotype SV sites in each sample
## First create a merged site VCF from discovery set
for BAM in *.bam; do
  SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
  svtyper \
    --bam ${BAM} \
    --vcf confirmed_svs.vcf \     ## merged discovery VCF
    --output ${SAMPLE}_svtyped.vcf
done

## Method B: duphold -- annotate depth fold-change at each SV
## duphold integrates read-depth evidence into any SV VCF
for BAM in *.bam; do
  SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
  duphold \
    --threads 8 \
    --vcf confirmed_svs.vcf \
    --bam ${BAM} \
    --fasta reference.fasta \
    --output ${SAMPLE}_duphold.bcf
  bcftools convert -O z ${SAMPLE}_duphold.bcf > ${SAMPLE}_duphold.vcf.gz
  tabix -p vcf ${SAMPLE}_duphold.vcf.gz
done

## duphold key annotations added to FORMAT field:
## DHFC: fold-change vs flanking regions (expected: DEL < 0.7, DUP > 1.3)
## DHFFC: fold-change vs flanking + GC correction
## DHBFC: fold-change in background windows (for homozygous deletions)

## Filter using duphold annotations (removes false positives efficiently)
bcftools filter \
  -i '(SVTYPE=="DEL" && FMT/DHFC < 0.7) || (SVTYPE=="DUP" && FMT/DHFC > 1.3)' \
  merged_genotyped.vcf.gz > depth_confirmed_svs.vcf
Download:

Step 6 — CNV Visualisation in R

R -- CNV genome-wide profile, size distribution, and region zoom
library(ggplot2); library(dplyr)

## Load per-bin read depth (output from mosdepth --quantize)
depth <- read.table("sample01.regions.bed.gz", header=FALSE,
                    col.names=c("chr","start","end","depth"))
depth$pos_mb <- (depth$start + depth$end) / 2 / 1e6
mean_depth <- mean(depth$depth)
depth$log2r <- log2(depth$depth / mean_depth + 0.001)  ## pseudocount

## Genome-wide CNV profile (one panel per chromosome)
ggplot(depth %>% filter(grepl("^chr[0-9]", chr)), aes(pos_mb, log2r)) +
  geom_point(size=0.2, alpha=0.3, colour="#2171b5") +
  geom_hline(yintercept=0, colour="grey50", linewidth=0.4) +
  geom_hline(yintercept=c(-1, 0.585, 1), linetype="dashed",
             colour=c("red","orange","red"), linewidth=0.3) +
  facet_wrap(~chr, ncol=4, scales="free_x") +
  scale_y_continuous(limits=c(-3, 2.5)) +
  labs(x="Position (Mb)", y="Log2 copy ratio",
       title="Genome-wide copy number profile",
       subtitle="Dashed lines: -1=het del, +0.585=het dup, +1=hom dup") +
  theme_minimal(base_size=9)
ggsave("cnv_genome_profile.png", width=14, height=10, dpi=200)

## CNV size distribution
cnvs <- read.table("confirmed_cnvs.bed", header=FALSE)
colnames(cnvs) <- c("chr","start","end","type","size")
cnvs$size_kb <- cnvs$size / 1000
cnvs$type2 <- ifelse(grepl("deletion", cnvs$type, ignore.case=TRUE), "Deletion", "Duplication")
ggplot(cnvs, aes(size_kb, fill=type2)) +
  geom_histogram(bins=50, colour="white", linewidth=0.2, position="identity", alpha=0.7) +
  scale_x_log10(name="CNV size (kb, log scale)") +
  scale_fill_manual(values=c("Deletion"="#e41a1c","Duplication"="#377eb8")) +
  labs(y="Count", fill="CNV type", title="CNV size distribution") +
  theme_bw(base_size=12)
ggsave("cnv_size_distribution.png", width=7, height=4, dpi=200)
Download:

Common Pitfalls

ProblemSymptomFix
Repetitive regions inflate CNV callsHundreds of CNVs in satellite DNA, centromeres, telomeresFilter by q0 > 0.5 in CNVnator output. Mask repeat regions with bedtools subtract using RepeatMasker BED. Check CNV coordinates in UCSC Repeat Masker track.
GC bias creates false CNV patternSystematic deletions in high-GC promoters; duplications in AT-rich regionsCNVnator and Control-FREEC apply GC correction. For custom pipelines use LOESS regression: fit depth ~ GC%, then divide each bin depth by the predicted value.
Unequal sequencing depth between samplesSample with lower depth has more false deletion callsAlways normalise to genome-wide mean depth per sample, not globally. Never compare raw read counts between samples.
PCR duplicates inflate duplication callsTandem duplications appear larger or at higher copy numberAlways mark and exclude PCR duplicates (Picard MarkDuplicates) before any CNV analysis. For amplicon libraries, use UMI-aware deduplication.
Manta misses tandem duplicationsLarge tandem dups detected by CNVnator but not MantaTandem duplications are harder for split-read callers because both breakpoints are at similar sequences. Use CNVnator + Manta union (OR) for duplications, but AND (intersection) for deletions.