End-to-end ATAC-seq and ChIP-seq analysis: QC, alignment, peak calling (MACS3), IDR reproducibility, motif analysis (HOMER/MEME), differential accessibility (DESeq2), and bigWig normalisation. Live FRiP calculator and peak signal spreadsheets included.

| Feature | ATAC-seq | ChIP-seq |
|---|---|---|
| What it measures | Chromatin accessibility (open vs closed chromatin) | Binding of a specific protein (transcription factor, histone mark) to DNA |
| How it works | Tn5 transposase cuts and tags open chromatin; closed chromatin is inaccessible | Immunoprecipitation of cross-linked protein-DNA; sequencing of enriched fragments |
| Starting material | 50,000-100,000 fresh nuclei; works from fresh/frozen tissue | Millions of fixed cells; requires antibody of interest |
| Controls needed | No IP control needed; mitochondrial DNA % is a QC proxy | Input DNA control (matched, same chromatin without IP) REQUIRED |
| Peak types | Two types: nucleosome-free regions (NFR, <200 bp) and nucleosomal (200-1000 bp) | TF binding: narrow peaks (~100-300 bp); histone marks: broad peaks (1-50 kb) |
| Key QC metric | FRiP (fraction of reads in peaks) >20%; TSS enrichment >7 | FRiP >1% (TF) or >5% (histone); input-normalised peak enrichment |
## ================================================================
## ATAC-seq / ChIP-seq alignment pipeline
## Tools: Trim Galore, Bowtie2, samtools, Picard, deepTools
## ================================================================
SAMPLE="sample_ATAC_rep1"
R1="${SAMPLE}_R1.fastq.gz"
R2="${SAMPLE}_R2.fastq.gz"
REF_INDEX="bowtie2_index/genome" ## pre-built bowtie2 index
CHROMSIZES="genome.chrom.sizes"
## Step 1: Adapter trimming
## ATAC-seq uses Nextera adapters; ChIP-seq typically uses TruSeq
trim_galore \
--paired \
--nextera \ ## Nextera adapters for ATAC-seq
--cores 8 \
--fastqc \
-o trimmed/ \
${R1} ${R2}
## For ChIP-seq: remove --nextera (TruSeq adapters auto-detected)
## Step 2: Alignment with Bowtie2
## --very-sensitive: more accurate; -X 2000: allow up to 2 kb insert (nucleosomal)
bowtie2 \
-p 16 \
--very-sensitive \
-X 2000 \
-x ${REF_INDEX} \
-1 trimmed/${SAMPLE}_R1_val_1.fq.gz \
-2 trimmed/${SAMPLE}_R2_val_2.fq.gz \
2> ${SAMPLE}_bowtie2.log | \
samtools view -bS -q 30 | \ ## MAPQ >= 30 (unique mappers only)
samtools sort -@ 8 \
-o ${SAMPLE}_sorted.bam
samtools index ${SAMPLE}_sorted.bam
## Alignment rate check
grep "overall alignment rate" ${SAMPLE}_bowtie2.log
## Expect: ATAC-seq >85%, ChIP-seq >80%
## Step 3: Remove PCR duplicates
picard MarkDuplicates \
INPUT=${SAMPLE}_sorted.bam \
OUTPUT=${SAMPLE}_dedup.bam \
METRICS_FILE=${SAMPLE}_dup_metrics.txt \
REMOVE_DUPLICATES=true \ ## set false to only mark; true to remove
VALIDATION_STRINGENCY=LENIENT
samtools index ${SAMPLE}_dedup.bam
## Duplication rate: expect ATAC-seq 20-40%; ChIP-seq 10-50%
grep "Unknown Library" ${SAMPLE}_dup_metrics.txt | head -2
## Step 4: ATAC-seq ONLY -- remove mitochondrial reads
## Mitochondrial DNA is highly accessible and inflates peak numbers
samtools view -h ${SAMPLE}_dedup.bam | \
grep -v "chrM\|chrMT\|MT" | \
samtools sort -O bam \
-o ${SAMPLE}_noMT.bam
samtools index ${SAMPLE}_noMT.bam
MT_reads=$(samtools view -c -f 1 ${SAMPLE}_dedup.bam chrM 2>/dev/null)
total_reads=$(samtools view -c -F 4 ${SAMPLE}_dedup.bam)
echo "MT reads fraction: $(echo "scale=3; ${MT_reads}/${total_reads}" | bc)"
## Expect: MT fraction < 20% for ATAC-seq; > 30% suggests poor nuclei prep
## Step 5: ATAC-seq ONLY -- shift reads by +4 / -5 bp (Tn5 cut correction)
## The Tn5 transposase cuts and inserts sequencing adapters at both ends of
## an accessible region. The actual cut site is offset by 4 bp (+ strand)
## and 5 bp (- strand) from the read start. deepTools alignmentSieve corrects this.
alignmentSieve \
--ATACshift \ ## apply +4/-5 Tn5 shift
--bam ${SAMPLE}_noMT.bam \
--outFile ${SAMPLE}_shifted.bam \
--numberOfProcessors 8
samtools sort -O bam -@ 8 -o ${SAMPLE}_shifted_sorted.bam ${SAMPLE}_shifted.bam
samtools index ${SAMPLE}_shifted_sorted.bam
## Step 6: Generate normalised bigWig for visualisation (IGV, UCSC)
bamCoverage \
--bam ${SAMPLE}_shifted_sorted.bam \
--outFileName ${SAMPLE}.bw \
--normalizeUsing RPGC \ ## reads per genomic content (depth-normalised)
--effectiveGenomeSize 2300000000 \ ## rice genome effective size
--extendReads \ ## extend PE reads to fragment length
--binSize 10 \
--numberOfProcessors 16
ATAC-seq has two unique QC metrics that have no equivalent in ChIP-seq: the fragment size distribution and the TSS enrichment score.
| QC Metric | What it shows | Good value | Bad value |
|---|---|---|---|
| Fragment size distribution | Nucleosomal ladder: peaks at ~200 bp (mono), ~400 bp (di), ~600 bp (tri-nucleosomal). NFR <150 bp should be the dominant peak. | Clear ladder with NFR peak dominant (>50% of reads) | No ladder (bad transposition) or only large fragments (over-digestion or poor nuclei) |
| TSS enrichment | Ratio of read density at TSSs vs flanking regions. Active genes have accessible chromatin around their TSSs. | >7 for human/mouse; >4 for most organisms | <3: poor signal-to-noise; nuclei were too fragmented or sample had low viability |
| FRiP score | Fraction of reads in peaks: reads_in_peaks / total_mapped_reads | >20% for ATAC; >1% for TF ChIP | <5%: sample failed or wrong antibody (ChIP) |
| Mitochondrial fraction | % of reads mapping to chrM | <20% | >40%: nuclei preparation was poor; cytoplasmic contamination |
## ================================================================
## ATAC-seq QC: fragment size distribution and TSS enrichment
## Using ATACseqQC package (Bioconductor)
## ================================================================
## BiocManager::install("ATACseqQC")
library(ATACseqQC); library(ggplot2)
BAM <- "sample_ATAC_rep1_shifted_sorted.bam"
GENOME <- "BSgenome.Osativa.IRGSP1" ## or hg38, mm10, etc.
## Fragment size distribution
fragSize <- fragSizeDist(BAM, bamFile.labels="ATAC-rep1")
## A well-prepared ATAC-seq sample shows a characteristic nucleosomal ladder:
## ~100 bp (NFR), ~200 bp (mono-nucleosomal), ~400 bp (di-nucleosomal), etc.
## TSS enrichment score with ATACseqQC
library(TxDb.Hsapiens.UCSC.hg38.knownGene) ## or equivalent for your organism
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
seqlev <- paste0("chr", c(1:22, "X", "Y"))
## Get TSS positions
txs <- transcripts(txdb, filter=list(tx_chrom=seqlev))
pt <- SSreg(BAM, txs=txs, genome=GENOME, outPath="TSS_enrichment",
seqlev=seqlev)
## Output: TSS_enrichment.pdf with enrichment profile
## Quick manual TSS enrichment with deepTools
## computeMatrix reference-point \
## --referencePoint TSS \
## --beforeRegionStartLength 2000 \
## --afterRegionStartLength 2000 \
## --binSize 10 \
## --skipZeros \
## -S sample_ATAC.bw \
## -R genes.bed \
## -o tss_matrix.gz
## plotHeatmap -m tss_matrix.gz -out tss_heatmap.png \
## --colorMap RdBu_r --sortUsing sum
MACS3 (Model-based Analysis of ChIP-seq) identifies genomic regions with significantly enriched read coverage. The key concept is the fold enrichment: reads at a peak vs the local background and genome-wide background.
## ================================================================ ## MACS3 peak calling ## ATAC-seq: no control, --nomodel, --shift -100 --extsize 200 ## ChIP-seq TF: with input control, narrow peaks ## ChIP-seq histone: broad peaks with --broad flag ## ================================================================ ## --- ATAC-seq peak calling --- macs3 callpeak \ -t sample_ATAC_rep1_shifted_sorted.bam \ -f BAMPE \ ## paired-end BAM with fragment coordinates --nomodel \ ## no model building (Tn5 shift already applied) --nolambda \ ## disable local lambda correction for ATAC --keep-dup all \ ## duplicates already removed in alignment step -g 3.8e8 \ ## effective genome size (rice: 3.8e8; human: 2.7e9) --outdir macs3_atac/ \ -n sample_ATAC_rep1 \ -B \ ## output bedGraph for visualisation -q 0.05 \ ## q-value (FDR) threshold for peak calling 2> macs3_atac/sample_ATAC_rep1_macs3.log ## Peak output files: ## _peaks.narrowPeak: peak coordinates + summit + -log10(q-value) + fold enrichment ## _summits.bed: single-base summit positions (best for motif analysis) ## _control_lambda.bdg, _treat_pileup.bdg: bedGraph tracks echo "ATAC peaks called: $(wc -l < macs3_atac/sample_ATAC_rep1_peaks.narrowPeak)" ## --- ChIP-seq TF narrow peaks (with input control) --- macs3 callpeak \ -t chip_TF_rep1.bam \ -c chip_input_rep1.bam \ ## matched input control (REQUIRED for ChIP) -f BAMPE \ -g 2.7e9 \ --outdir macs3_chip_TF/ \ -n TF_rep1 \ -q 0.01 \ ## stricter threshold for ChIP-seq --call-summits ## output summit positions for TF binding ## --- ChIP-seq histone broad peaks (H3K27me3, H3K36me3 etc.) --- macs3 callpeak \ -t chip_H3K27me3_rep1.bam \ -c chip_input_rep1.bam \ -f BAMPE \ --broad \ ## broad peak calling mode --broad-cutoff 0.1 \ ## q-value for broad region -g 2.7e9 \ --outdir macs3_chip_broad/ \ -n H3K27me3_rep1 ## Step: filter blacklisted regions (known ENCODE artefact regions) ## Download from: https://github.com/Boyle-Lab/Blacklist bedtools intersect \ -v \ ## -v = keep entries NOT overlapping blacklist -a macs3_atac/sample_ATAC_rep1_peaks.narrowPeak \ -b hg38_blacklist_v2.bed > sample_ATAC_rep1_peaks_filtered.narrowPeak echo "Peaks after blacklist filter: $(wc -l < sample_ATAC_rep1_peaks_filtered.narrowPeak)"
IDR (Li et al. 2011) is the ENCODE standard for measuring reproducibility between replicates. It ranks peaks by signal in both replicates and tests whether the top-ranked peaks are consistently high in both -- a truly reproducible peak should be highly ranked in both replicates.
## ================================================================
## IDR (Irreproducible Discovery Rate) analysis
## Requires: idr package (pip install idr)
## Input: narrowPeak files from two replicates, sorted by -log10(p-value)
## ================================================================
## Sort peaks by p-value (column 8 of narrowPeak = -log10 q-value)
sort -k8,8rn macs3_atac/sample_ATAC_rep1_peaks.narrowPeak \
> rep1_sorted.narrowPeak
sort -k8,8rn macs3_atac/sample_ATAC_rep2_peaks.narrowPeak \
> rep2_sorted.narrowPeak
## Run IDR
idr \
--samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
--input-file-type narrowPeak \
--rank p.value \ ## rank by p-value (most common for ATAC-seq)
--output-file idr_output.txt \
--output-file-type narrowPeak \
--idr-threshold 0.05 \ ## IDR < 0.05 = reproducible peak
--plot \ ## generate IDR diagnostic plot
--log-output-file idr.log
## Count IDR-passing peaks
IDR_PEAKS=$(wc -l < idr_output.txt)
echo "IDR-reproducible peaks (IDR < 0.05): ${IDR_PEAKS}"
## Expect: for a good ATAC-seq experiment, 60-80% of peaks in each replicate
## should pass IDR
## IDR output column 12 = IDR score; filter at IDR < 0.05 (local IDR)
awk '$12 < 0.05' idr_output.txt > idr_reproducible_peaks.bed
echo "High-reproducibility peaks: $(wc -l < idr_reproducible_peaks.bed)"
## Create consensus peak set (union of IDR peaks from both replicates)
## Standard: extend peaks to 500 bp centred on summit for consistency
awk 'BEGIN{OFS="\t"} {
summit = $2 + $10; ## $10 = summit offset from peak start
start = summit - 250;
end = summit + 250;
if (start < 0) start = 0;
print $1, start, end, $4, $5
}' idr_reproducible_peaks.bed | \
sort -k1,1 -k2,2n | \
bedtools merge -i - > consensus_peaks_500bp.bed
echo "Consensus peak set: $(wc -l < consensus_peaks_500bp.bed) peaks"
## ================================================================
## Motif analysis: HOMER de novo + MEME suite
## Input: peak summit BED file (100-200 bp centred on summit)
## ================================================================
PEAKS="consensus_peaks_500bp.bed"
GENOME_FASTA="genome.fasta"
GENOME_DIR="homer_genome/" ## pre-prepared by HOMER: prepareGenome.pl
## Step 1: HOMER de novo motif finding
## HOMER searches for enriched 8-12 bp sequences in peaks vs background
findMotifsGenome.pl \
${PEAKS} \
${GENOME_DIR} \ ## genome prepared with HOMER prepareGenome.pl
motif_results/ \
-size 200 \ ## sequence window around peak summit
-mask \ ## mask repeat regions
-p 16 \ ## threads
-len 8,10,12 ## motif lengths to search
## Output: motif_results/homerResults.html -- interactive motif report
## Top de novo motifs with E-values, % targets, % background, known matches
## motif_results/knownResults.txt -- enrichment of known TF motifs from JASPAR
## Step 2: MEME-ChIP for comprehensive motif analysis
## First extract sequences from peaks
bedtools getfasta \
-fi ${GENOME_FASTA} \
-bed ${PEAKS} \
-fo peak_sequences.fasta
## Run MEME-ChIP
meme-chip \
-oc meme_results/ \
-db /path/to/JASPAR2022_CORE_plants_non-redundant.meme \ ## motif database
-meme-p 16 \
-meme-nmotifs 15 \
-meme-minw 6 -meme-maxw 20 \
peak_sequences.fasta
## Step 3: Motif scanning (find all instances of a known motif in peaks)
## Scan for all JASPAR TF binding sites in your peak set
fimo \
--oc fimo_results/ \
--thresh 1e-4 \
JASPAR2022_CORE_plants.meme \
peak_sequences.fasta
## Output: fimo_results/fimo.tsv -- all significant motif hits
## ================================================================
## Differential accessibility (ATAC-seq) using DESeq2
## Same approach works for differential ChIP-seq binding
## ================================================================
## BiocManager::install(c("DESeq2", "chromVAR", "GenomicRanges"))
library(DESeq2); library(dplyr); library(ggplot2)
## Step 1: Count reads in consensus peaks for each sample
## Use featureCounts (Subread package) or bedtools multicov
## featureCounts -a consensus_peaks_500bp.saf -F SAF -p -T 8 \
## -o peak_counts.txt rep1.bam rep2.bam treatment1.bam treatment2.bam
## Load count matrix
counts <- read.table("peak_counts.txt", header=TRUE, skip=1, row.names=1)
count_matrix <- as.matrix(counts[, 6:ncol(counts)]) ## columns 6+ = sample counts
colnames(count_matrix) <- gsub(".*/(.*)\\.bam", "\\1", colnames(count_matrix))
## Sample metadata
meta <- data.frame(
sample = colnames(count_matrix),
condition = c("control","control","treatment","treatment"),
row.names = colnames(count_matrix)
)
## Step 2: Run DESeq2
dds <- DESeqDataSetFromMatrix(
countData = count_matrix,
colData = meta,
design = ~condition
)
## Filter low-count peaks
dds <- dds[rowSums(counts(dds) >= 10) >= 2, ]
dds <- DESeq(dds)
## Step 3: Extract results
res <- results(dds,
contrast = c("condition","treatment","control"),
alpha = 0.05)
res_df <- as.data.frame(res) %>%
tibble::rownames_to_column("peak_id") %>%
filter(!is.na(padj)) %>%
arrange(padj)
cat("Gained peaks (treatment > control, FC>2, FDR<5%):",
sum(res_df$padj < 0.05 & res_df$log2FoldChange > 1, na.rm=TRUE), "\n")
cat("Lost peaks (control > treatment, FC<0.5, FDR<5%):",
sum(res_df$padj < 0.05 & res_df$log2FoldChange < -1, na.rm=TRUE), "\n")
## Step 4: MA plot
ggplot(res_df, aes(baseMean, log2FoldChange,
colour=padj < 0.05 & abs(log2FoldChange) > 1)) +
geom_point(size=0.5, alpha=0.6) +
scale_colour_manual(values=c("FALSE"="#adb5bd","TRUE"="#e63946"),
labels=c("Not sig","Significant"), name=NULL) +
scale_x_log10() +
geom_hline(yintercept=c(-1,0,1), linetype=c("dashed","solid","dashed"),
colour=c("blue","grey50","red"), linewidth=0.5) +
labs(x="Mean normalised count", y="Log2 fold change (treatment / control)",
title="Differential chromatin accessibility (DESeq2)",
subtitle=paste0(sum(res_df$padj < 0.05 & res_df$log2FoldChange > 1, na.rm=TRUE),
" gained peaks; ",
sum(res_df$padj < 0.05 & res_df$log2FoldChange < -1, na.rm=TRUE),
" lost peaks")) +
theme_bw(base_size=12)
ggsave("differential_accessibility_MA.png", width=8, height=6, dpi=200)
## ================================================================
## Heatmap and average profile plots using deepTools
## ================================================================
## computeMatrix: compute coverage in windows around each peak
computeMatrix reference-point \
--referencePoint center \
--beforeRegionStartLength 2000 \
--afterRegionStartLength 2000 \
--binSize 25 \
--skipZeros \
--missingDataAsZero \
-S sample_ATAC_control_rep1.bw \
sample_ATAC_control_rep2.bw \
sample_ATAC_treatment_rep1.bw \
sample_ATAC_treatment_rep2.bw \
-R consensus_peaks_500bp.bed \
-o atac_heatmap_matrix.gz \
-p 16
## plotHeatmap: sorted by signal, with colour scale
plotHeatmap \
-m atac_heatmap_matrix.gz \
--out atac_peak_heatmap.png \
--colorMap RdYlBu_r \
--zMax 5 --zMin -1 \
--sortUsing sum \
--samplesLabel Control_R1 Control_R2 Treatment_R1 Treatment_R2 \
--regionsLabel "Consensus peaks" \
--dpi 200
## plotProfile: average signal across all peaks
plotProfile \
-m atac_heatmap_matrix.gz \
--out atac_avg_profile.png \
--plotType lines \
--perGroup \
--colors "#2166ac" "#4393c3" "#d6604d" "#b2182b" \
--samplesLabel Control_R1 Control_R2 Treatment_R1 Treatment_R2
## ================================================================
## R: volcano plot for differential accessibility
## ================================================================
library(ggplot2); library(ggrepel)
res_df$label <- ifelse(res_df$padj < 0.001 & abs(res_df$log2FoldChange) > 2,
res_df$peak_id, "")
ggplot(res_df, aes(log2FoldChange, -log10(padj),
colour=padj < 0.05 & abs(log2FoldChange) > 1)) +
geom_point(size=0.8, alpha=0.7) +
geom_text_repel(aes(label=label), size=2.5, max.overlaps=20) +
scale_colour_manual(values=c("FALSE"="#adb5bd","TRUE"="#e63946"),
labels=c("Not sig","FDR<5%, FC>2"), name=NULL) +
geom_vline(xintercept=c(-1,1), linetype="dashed", colour="grey40") +
geom_hline(yintercept=-log10(0.05), linetype="dashed", colour="grey40") +
labs(x="Log2 fold change", y="-log10(adjusted p-value)",
title="Differential ATAC-seq peaks: Volcano plot") +
theme_bw(base_size=12)
ggsave("differential_accessibility_volcano.png", width=8, height=6, dpi=200)
| Problem | Symptom | Fix |
|---|---|---|
| Low FRiP in ATAC-seq | FRiP < 5%; too many background reads | Most likely cause: poor nuclei isolation (cytoplasmic contamination). Repeat nuclei prep. Alternatively: insufficient sequencing depth or wrong cell type being assayed. |
| Mitochondrial reads dominate | MT fraction > 40% | Mitochondria are highly accessible. Improve nuclei isolation protocol: use IGEPAL-CA630 (0.1%) or NP-40 to lyse cells more cleanly. Add 0.1% digitonin. Always remove MT reads before peak calling. |
| No nucleosomal ladder | Fragment size distribution is flat; no peaks at 200, 400 bp | Over-transposition (too much Tn5) or degraded DNA. Optimise Tn5 concentration (typically 2.5 uL per 50K cells). Check DNA integrity on Bioanalyzer before library prep. |
| ChIP-seq peaks without input control | Thousands of peaks in gene-poor regions; peaks in blacklist regions | ALWAYS use a matched input control for ChIP-seq. Input-less peak calling (--nolambda) is only appropriate for ATAC-seq. |
| IDR below 50% passing | Less than half of peaks in each replicate are IDR-reproducible | Check individual replicate quality (FRiP, TSS enrichment). Replicates may have batch effects (different cell passages, different operators). Consider whether the biological condition is inherently variable. |
| Differential analysis confounded by read depth | Samples with more reads always have more "gained" peaks | Always normalise: DESeq2 uses size factors computed from the count matrix. Use quantile normalisation (chromVAR) for large-scale comparisons. Never compare raw peak counts between samples. |