Copy Number Variation (CNV) Detection from WGS

CNV Concepts

Copy Number Variations (CNVs) are genomic regions that are deleted or duplicated relative to a reference genome. They range from a few hundred base pairs to entire chromosome arms, and account for a larger fraction of genetic diversity between individuals than SNPs.

CNV Type	Copy number change	Effect	Detection signal
Deletion (DEL)	Loss of one or both copies (CN 0 or 1 vs expected 2)	Haploinsufficiency, loss of function	Decreased read depth, paired-end discordance
Duplication (DUP)	Gain of one or more copies (CN 3+)	Gene dosage effect, neo-functionalization	Increased read depth, tandem split reads
Inversion (INV)	No copy change; orientation flipped	Gene disruption if breakpoint in gene; position effect	Paired-end orientation discordance
Translocation (TRA/BND)	Chromosomal segment moves to new location	Fusion genes, altered regulation	Inter-chromosomal paired-end discordance
Complex SV	Multiple rearrangements at same locus	Variable; often severe	Combination of all signals; difficult to resolve

Four Detection Signals from WGS

Signal	Abbreviation	How it works	Best for
Read Depth (RD)	RD/RC	Count reads per bin; normalize; segment into copy number states. Deleted regions have half the normal coverage; duplicated regions have 1.5x or 2x.	Large CNVs (>1 kb); tandem duplications; reliable for copy number estimation
Paired-End (PE)	DP/SR	Read pairs that map too far apart, too close, or in wrong orientation signal a structural variant at their mapping positions.	Breakpoint detection; inversions; translocations
Split Read (SR)	SR/SA	Reads that span a breakpoint are split: one part maps to one location, the other maps elsewhere. Base-pair resolution breakpoints.	Precise breakpoint calling; small SVs; insertion detection
Local Assembly (LA)	AS	Assemble reads de novo in candidate regions; compare assembled contigs to reference to detect variants directly.	Complex SVs; mobile elements; most accurate method

Always use at least two independent signals. Single-signal CNV callers have false positive rates of 30-50%. The standard best practice is to intersect calls from a read-depth caller (CNVnator) with a split-read/paired-end caller (Manta or DELLY). Variants supported by both methods have false positive rates below 5%. For clinical-grade calls, add local assembly (Manta's assembled contigs) as a third line of evidence.

Step 1 — Read-Depth Analysis: The Math

Read-depth CNV detection normalises coverage per bin and then segments the normalised log-ratio to identify copy-number states. Understanding the normalisation formula helps you interpret and troubleshoot results.

Read-Depth CNV Formulas

Normalised depth = (bin_depth / mean_depth) * 2

Log2 ratio = log2(normalised_depth / 2) = log2(bin_depth / mean_depth)

Expected copy number = 2 * (bin_depth / mean_depth)

Log2 ratio = 0Normal diploid (copy number = 2) Log2 ratio = -1Heterozygous deletion (one copy deleted, CN = 1) Log2 ratio = -InfHomozygous deletion (both copies deleted, CN = 0) Log2 ratio = +0.585Heterozygous duplication (one extra copy, CN = 3) Log2 ratio = +1.0Homozygous duplication (CN = 4)

Step 2 — CNVnator Read-Depth Pipeline

Bash -- CNVnator read-depth CNV calling pipeline

## ================================================================
## CNVnator v0.4.1+ -- read-depth segmentation
## Works for germline CNVs in diploid organisms
## Required: sorted, duplicate-marked BAM; reference FASTA split by chr
## ================================================================

BAM="sample01_sorted_markdup.bam"
REF_DIR="reference_chromosomes/"   ## one FASTA file per chromosome
SAMPLE="sample01"
BIN=500    ## bin size in bp: 100-500 bp for 30x WGS

## Step 1: Extract read depth histogram from BAM
cnvnator -root ${SAMPLE}.root \
  -tree ${BAM} \
  -chrom $(seq -s' ' 1 12 | sed 's/^/chr/g') ## specify chromosomes

## Step 2: Generate read depth histograms (one per bin size)
cnvnator -root ${SAMPLE}.root \
  -his ${BIN} \
  -d ${REF_DIR}    ## reference chromosome files for GC correction

## Step 3: Compute statistics per bin (mean, variance)
cnvnator -root ${SAMPLE}.root \
  -stat ${BIN}

## Step 4: RD signal partitioning (segmentation into CN states)
cnvnator -root ${SAMPLE}.root \
  -partition ${BIN}

## Step 5: Call CNVs from segments
cnvnator -root ${SAMPLE}.root \
  -call ${BIN} > ${SAMPLE}_cnvnator_${BIN}bp.txt

## Output format:
## CNV_type chr:start-end size normalised_RD e-value1 e-value2 q0
## CNV_type: deletion/duplication
## normalised_RD: ~1.0 = het del, ~0 = hom del, ~1.5 = het dup
## e-value1: based on t-test; e-value2: based on Gaussian mixture model
## q0: fraction of reads mapping to MAPQ=0 (high q0 = repetitive region)

## Filter: keep high-quality calls
awk '$8 < 0.5' ${SAMPLE}_cnvnator_${BIN}bp.txt > ${SAMPLE}_cnvnator_highQ.txt
## q0 < 0.5 means fewer than 50% of reads have MAPQ=0 (non-repetitive)

## Convert to BED format for downstream intersect
awk '{
  split($2, a, /[:-]/);
  print a[1], a[2], a[3], $1, $3, $4
}' OFS="\t" ${SAMPLE}_cnvnator_highQ.txt > ${SAMPLE}_cnvnator.bed

echo "CNVnator calls: $(wc -l < ${SAMPLE}_cnvnator_highQ.txt)"
echo "Deletions: $(grep deletion ${SAMPLE}_cnvnator_highQ.txt | wc -l)"
echo "Duplications: $(grep duplication ${SAMPLE}_cnvnator_highQ.txt | wc -l)"

Download:

Step 3 — SV Callers: Manta and LUMPY

Bash -- Manta (split-read + PE) and SMOOVE/LUMPY structural variant calling

## ================================================================
## Manta v1.6+ -- split-read + paired-end SV calling
## Manta uses local assembly for precise breakpoints
## ================================================================

BAM="sample01_sorted_markdup.bam"
REF="IRGSP1.0_genome.fasta"
OUTDIR="manta_output"

## Step 1: Configure Manta
configManta.py \
  --bam ${BAM} \
  --referenceFasta ${REF} \
  --runDir ${OUTDIR}

## Step 2: Run Manta (uses all available cores automatically)
${OUTDIR}/runWorkflow.py -j 16 -m local

## Main outputs:
## candidateSV.vcf.gz: all candidate SVs (lower quality filter)
## diploidSV.vcf.gz: high-quality diploid SVs (use this)
## candidateSmallIndels.vcf.gz: for GATK HaplotypeCaller as indel candidates

## Step 3: Filter Manta output
bcftools view \
  -f PASS \                              ## only PASS filter variants
  -i 'SVLEN >= 50 || SVTYPE == "BND"' \ ## SVs >= 50 bp; all translocations
  ${OUTDIR}/results/variants/diploidSV.vcf.gz | \
  bcftools sort | bgzip -c > manta_filtered.vcf.gz
tabix -p vcf manta_filtered.vcf.gz

## ================================================================
## SMOOVE (wrapper around LUMPY) -- population-level SV genotyping
## Best for cohort studies; genotypes SVs across all samples jointly
## ================================================================

## Step 1: Call SVs per sample
for BAM_FILE in *.bam; do
  SAMPLE_NAME=$(basename ${BAM_FILE} _sorted_markdup.bam)
  smoove call \
    --outdir smoove_per_sample \
    --name ${SAMPLE_NAME} \
    --fasta ${REF} \
    -p 8 \
    --genotype \
    ${BAM_FILE}
done

## Step 2: Merge all sample VCFs into a unified site list
smoove merge \
  --name rice3k \
  --fasta ${REF} \
  --outdir smoove_merged \
  smoove_per_sample/*.genotyped.vcf.gz

## Step 3: Re-genotype all samples at merged sites
for VCF_FILE in smoove_per_sample/*.genotyped.vcf.gz; do
  SAMPLE_NAME=$(basename ${VCF_FILE} .genotyped.vcf.gz)
  smoove genotype \
    --duphold \                          ## annotate with depth-based quality
    --name ${SAMPLE_NAME} \
    --outdir smoove_genotyped \
    --fasta ${REF} \
    --vcf smoove_merged/rice3k.sites.vcf.gz \
    ${SAMPLE_NAME}_sorted_markdup.bam
done

## Step 4: Paste into multi-sample VCF
smoove paste \
  --name rice3k_cohort \
  --outdir smoove_final \
  smoove_genotyped/*.vcf.gz

## Output: rice3k_cohort.smoove.square.vcf.gz
## This is a multi-sample SV VCF with genotypes at all sites in all samples

Download:

Step 4 — Merging and Filtering Callers

Bash + R -- Intersect CNVnator and Manta calls; calculate false positive rates

## ================================================================
## Merge CNVnator (RD) + Manta (SR+PE) calls
## Variants supported by both methods = high-confidence CNVs
## ================================================================

## Convert Manta VCF to BED (for bedtools intersect)
bcftools view -i 'SVTYPE=="DEL" || SVTYPE=="DUP"' manta_filtered.vcf.gz | \
  bcftools query \
    -f '%CHROM\t%POS\t%END\t%SVTYPE\t%SVLEN\n' | \
  awk '{if($5<0) $5=-$5; print}' OFS="\t" > manta_deldups.bed

## CNVnator BED already created in step 2
## Intersect: require 50% reciprocal overlap
bedtools intersect \
  -a cnvnator_filtered.bed \
  -b manta_deldups.bed \
  -f 0.5 -r \                ## 50% reciprocal overlap
  -wa -wb \
  > confirmed_cnvs.bed

echo "CNVnator calls: $(wc -l < cnvnator_filtered.bed)"
echo "Manta calls: $(wc -l < manta_deldups.bed)"
echo "Confirmed by both (50% overlap): $(wc -l < confirmed_cnvs.bed)"

## ================================================================
## R: CNV quality filter and annotation
## ================================================================
library(dplyr)
cnvs <- read.table("confirmed_cnvs.bed", sep="\t",
  col.names=c("chr","start","end","type","size","norm_RD",
              "chr2","start2","end2","type2","size2"))
cnvs$size_kb <- cnvs$size / 1000

## Quality filters:
## 1. Remove CNVs in repetitive regions (high q0)
## 2. Remove very small CNVs (< 1 kb) -- likely noise
## 3. Remove CNVs in centromeres/telomeres
cnvs_filtered <- cnvs %>%
  filter(size_kb >= 1) %>%
  filter(!grepl("chrUn|random|scaffold", chr))

cat("High-confidence CNVs after filtering:", nrow(cnvs_filtered), "\n")
cat("  Deletions:", sum(cnvs_filtered$type == "deletion"), "\n")
cat("  Duplications:", sum(cnvs_filtered$type == "duplication"), "\n")
cat("  Size range:", round(min(cnvs_filtered$size_kb),1), "-",
    round(max(cnvs_filtered$size_kb),1), "kb\n")

Download:

Step 5 — CNV Genotyping Across a Population

Bash -- Genotype CNV sites in all samples using SVTyper and duphold

## ================================================================
## Genotype CNV sites across a population
## Two approaches: SVTyper (PE+SR) and duphold (depth validation)
## ================================================================

## Method A: SVTyper -- genotype SV sites in each sample
## First create a merged site VCF from discovery set
for BAM in *.bam; do
  SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
  svtyper \
    --bam ${BAM} \
    --vcf confirmed_svs.vcf \     ## merged discovery VCF
    --output ${SAMPLE}_svtyped.vcf
done

## Method B: duphold -- annotate depth fold-change at each SV
## duphold integrates read-depth evidence into any SV VCF
for BAM in *.bam; do
  SAMPLE=$(basename ${BAM} _sorted_markdup.bam)
  duphold \
    --threads 8 \
    --vcf confirmed_svs.vcf \
    --bam ${BAM} \
    --fasta reference.fasta \
    --output ${SAMPLE}_duphold.bcf
  bcftools convert -O z ${SAMPLE}_duphold.bcf > ${SAMPLE}_duphold.vcf.gz
  tabix -p vcf ${SAMPLE}_duphold.vcf.gz
done

## duphold key annotations added to FORMAT field:
## DHFC: fold-change vs flanking regions (expected: DEL < 0.7, DUP > 1.3)
## DHFFC: fold-change vs flanking + GC correction
## DHBFC: fold-change in background windows (for homozygous deletions)

## Filter using duphold annotations (removes false positives efficiently)
bcftools filter \
  -i '(SVTYPE=="DEL" && FMT/DHFC < 0.7) || (SVTYPE=="DUP" && FMT/DHFC > 1.3)' \
  merged_genotyped.vcf.gz > depth_confirmed_svs.vcf

Download:

Step 6 — CNV Visualisation in R

R -- CNV genome-wide profile, size distribution, and region zoom

library(ggplot2); library(dplyr)

## Load per-bin read depth (output from mosdepth --quantize)
depth <- read.table("sample01.regions.bed.gz", header=FALSE,
                    col.names=c("chr","start","end","depth"))
depth$pos_mb <- (depth$start + depth$end) / 2 / 1e6
mean_depth <- mean(depth$depth)
depth$log2r <- log2(depth$depth / mean_depth + 0.001)  ## pseudocount

## Genome-wide CNV profile (one panel per chromosome)
ggplot(depth %>% filter(grepl("^chr[0-9]", chr)), aes(pos_mb, log2r)) +
  geom_point(size=0.2, alpha=0.3, colour="#2171b5") +
  geom_hline(yintercept=0, colour="grey50", linewidth=0.4) +
  geom_hline(yintercept=c(-1, 0.585, 1), linetype="dashed",
             colour=c("red","orange","red"), linewidth=0.3) +
  facet_wrap(~chr, ncol=4, scales="free_x") +
  scale_y_continuous(limits=c(-3, 2.5)) +
  labs(x="Position (Mb)", y="Log2 copy ratio",
       title="Genome-wide copy number profile",
       subtitle="Dashed lines: -1=het del, +0.585=het dup, +1=hom dup") +
  theme_minimal(base_size=9)
ggsave("cnv_genome_profile.png", width=14, height=10, dpi=200)

## CNV size distribution
cnvs <- read.table("confirmed_cnvs.bed", header=FALSE)
colnames(cnvs) <- c("chr","start","end","type","size")
cnvs$size_kb <- cnvs$size / 1000
cnvs$type2 <- ifelse(grepl("deletion", cnvs$type, ignore.case=TRUE), "Deletion", "Duplication")
ggplot(cnvs, aes(size_kb, fill=type2)) +
  geom_histogram(bins=50, colour="white", linewidth=0.2, position="identity", alpha=0.7) +
  scale_x_log10(name="CNV size (kb, log scale)") +
  scale_fill_manual(values=c("Deletion"="#e41a1c","Duplication"="#377eb8")) +
  labs(y="Count", fill="CNV type", title="CNV size distribution") +
  theme_bw(base_size=12)
ggsave("cnv_size_distribution.png", width=7, height=4, dpi=200)

Download:

Common Pitfalls

Problem	Symptom	Fix
Repetitive regions inflate CNV calls	Hundreds of CNVs in satellite DNA, centromeres, telomeres	Filter by q0 > 0.5 in CNVnator output. Mask repeat regions with bedtools subtract using RepeatMasker BED. Check CNV coordinates in UCSC Repeat Masker track.
GC bias creates false CNV pattern	Systematic deletions in high-GC promoters; duplications in AT-rich regions	CNVnator and Control-FREEC apply GC correction. For custom pipelines use LOESS regression: fit depth ~ GC%, then divide each bin depth by the predicted value.
Unequal sequencing depth between samples	Sample with lower depth has more false deletion calls	Always normalise to genome-wide mean depth per sample, not globally. Never compare raw read counts between samples.
PCR duplicates inflate duplication calls	Tandem duplications appear larger or at higher copy number	Always mark and exclude PCR duplicates (Picard MarkDuplicates) before any CNV analysis. For amplicon libraries, use UMI-aware deduplication.
Manta misses tandem duplications	Large tandem dups detected by CNVnator but not Manta	Tandem duplications are harder for split-read callers because both breakpoints are at similar sequences. Use CNVnator + Manta union (OR) for duplications, but AND (intersection) for deletions.

All Tutorials Advanced GWAS

Copy Number Variation (CNV) Detection

On this page

CNV Concepts

Four Detection Signals from WGS

Step 1 — Read-Depth Analysis: The Math

Step 2 — CNVnator Read-Depth Pipeline

Step 3 — SV Callers: Manta and LUMPY

Step 4 — Merging and Filtering Callers

Step 5 — CNV Genotyping Across a Population

Step 6 — CNV Visualisation in R

Common Pitfalls