ATAC-seq Chromatin Accessibility Analysis

Overview

ATAC-seq (Assay for Transposase-Accessible Chromatin by sequencing) uses a hyperactive Tn5 transposase to tag and fragment open chromatin regions. Sequencing these fragments reveals genome-wide chromatin accessibility — a proxy for active regulatory elements.

Why remove organellar reads? Mitochondrial and chloroplast DNA in plants is highly accessible (no nucleosome packaging) and can dominate 20–80% of ATAC-seq reads. These reads represent contamination of the signal — removing them is essential before peak calling to avoid false peaks and to correctly normalize genomic accessibility.

Step	Tool	Purpose
QC	FastQC	Check read quality; adapters often pre-removed
Index	bowtie2-build	Build genome index for alignment
Align	bowtie2	Map reads to genome (ATAC uses generic aligner, not splice-aware)
Filter	samtools idxstats + view	Remove Mt and Pt (organellar) reads
Peak call	MACS2	Identify open chromatin regions
Visualize	bedGraphToBigWig + IGV	Generate browser tracks

Ideal Directory Structure

Organize your ATAC-seq project like this. The numbered folder convention keeps every step's outputs separate and easy to re-run.

atacseq_brm_project/

atacseq_brm_project/ ├── 0_genome/ # genome FASTA + Bowtie2 index (.bt2 files) ├── 1_data/ # raw FASTQ (read-only! never modify) │ ├── SRR4733912_1.fastq.gz │ └── SRR4733912_2.fastq.gz ├── 2_fastqc/ # FastQC reports ├── 3_bam/ # all BAM files │ ├── SRR4733912.bam # raw alignment │ ├── SRR4733912.nuclear.bam # organelles removed │ └── SRR4733912.nuclear.bam.bai ├── 4_peaks/ # MACS2 peak files │ ├── WT_peaks.broadPeak │ └── WT_treat_pileup.bdg ├── 5_bigwig/ # normalized BigWig tracks for IGV │ └── WT.bw └── scripts/ # your .sh scripts

🌱

Plant ATAC-seq gotcha: The 0_genome/ folder needs both the nuclear chromosomes AND the organellar chromosomes (Mt, Pt) in the same FASTA for alignment — you need to align everything first, then filter organellar reads after. If you omit Mt/Pt from the genome, those reads will mis-map to nuclear chromosomes and create ghost peaks!

Dataset

This dataset comes from Jégu et al. (2017), studying how the BRM chromatin remodeler controls Arabidopsis seedling morphogenesis by modulating chromatin accessibility. The ATAC-seq libraries are paired-end and the adapters were removed prior to deposition.

Sample	ENA ID	Description
ATAC WT seedling	SRR4733912	Paired-end, Nextera adapters pre-removed

BioProject: PRJNA351855

Step 1 — Download Data

Bash — Download ATAC-seq reads and Arabidopsis genome

mkdir -p 1_data 0_genome

# Download ATAC-seq paired-end reads
cd 1_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_2.fastq.gz
cd ..

# Download Arabidopsis TAIR10 genome from EnsemblPlants
cd 0_genome
wget http://ftp.ensemblgenomes.org/pub/plants/release-54/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ln -s Arabidopsis_thaliana.TAIR10.dna.toplevel.fa At_Genome
cd ..

Download:

What does each command do?

➕cat genome.fna Mt.fna Pt.fna	Concatenates the nuclear genome with the mitochondrial (Mt) and chloroplast (Pt) genomes into one single FASTA. Bowtie2 needs one combined index so reads from organelles are captured and can be removed later — rather than mismapping to the nuclear genome.
🗺️bowtie2-build	Builds an FM-index of the combined genome. Unlike HISAT2, Bowtie2 does not handle splicing — perfect for ATAC-seq which comes from genomic DNA, not mRNA.

�

What is ln -s? Creates a symbolic link (shortcut) named At_Genome pointing to the long genome filename. Tools like bowtie2-build can then use the short name. If you ever regenerate the genome file, just update the symlink — all scripts keep working.

Step 2 — Quality Check

Bash — FastQC

module load fastqc
mkdir -p 2_fastqc
fastqc -o 2_fastqc 1_data/*.fastq.gz

# For this dataset: adapters were already removed by the authors
# Expect clean per-base quality and minimal adapter content

Download:

Step 3 — Bowtie2 Alignment

Bash — Bowtie2 Index + Alignment

module load bowtie2
module load samtools

# Build genome index (run once)
mkdir -p 3_bwt_index
bowtie2-build 0_genome/At_Genome 3_bwt_index/At.TAIR10
echo "Index complete. Files:"
ls 3_bwt_index/

# Align ATAC-seq reads
mkdir -p 4_bam
bowtie2 --threads 8 \
    -x 3_bwt_index/At.TAIR10 \
    -q \
    -1 1_data/SRR4733912_1.fastq.gz \
    -2 1_data/SRR4733912_2.fastq.gz \
    -S 4_bam/SRR4733912.sam \
    2> 4_bam/SRR4733912_alignment.log

cat 4_bam/SRR4733912_alignment.log  # Check alignment rate

# Convert to sorted BAM
samtools view --threads 7 -bS 4_bam/SRR4733912.sam \
    | samtools sort --threads 7 - > 4_bam/SRR4733912.sorted.bam

samtools index 4_bam/SRR4733912.sorted.bam
rm 4_bam/SRR4733912.sam   # save disk space

# View BAM header (verify chromosome names)
samtools view -H 4_bam/SRR4733912.sorted.bam | head -20

Download:

Bowtie2 flags decoded

🧵--threads 8	Use 8 CPU threads to speed up alignment. Bowtie2 parallelizes very well — 8 threads is ~7× faster than single-threaded.
🗺️-x 3_bwt_index/At.TAIR10	The genome index prefix. Bowtie2 looks for files named `At.TAIR10.1.bt2`, `At.TAIR10.2.bt2` etc.
📖-q	Input is in FASTQ format (the default, but explicit is clearer). Alternatively `-f` for FASTA.
👫-1 / -2	Forward (R1) and reverse (R2) read files. Bowtie2 uses the paired information to improve alignment accuracy and correctly estimate insert sizes.
💾-S sample.sam	Write output in SAM format. Like RNA-seq, we immediately convert to BAM and delete the SAM.
🔍samtools view -H	Prints the BAM Header, which contains chromosome names and lengths. Always check this — MACS2 must get chromosome names from the BAM, and they must match your annotation file.

Step 4 — Remove Organellar Reads (Critical for Plants)

Bash — Remove Mt and Pt reads

module load samtools

# First, see the read counts per chromosome
# This shows how many reads align to Mt and Pt vs nuclear chromosomes
samtools idxstats 4_bam/SRR4733912.sorted.bam

# EXAMPLE OUTPUT (from actual data):
# 1       30427671   2018268  43655
# 2       19698289   2796545  47708
# 3       23459830   1738071  36100
# 4       18585056   1290746  26515
# 5       26975502   1795689  40246
# Mt      366924     2469591  35167     <-- ~2.4M reads = 20% of data!
# Pt      154478     44476436 524929    <-- ~44M reads = DOMINANT contamination

# Keep only nuclear chromosomes (exclude Mt and Pt)
samtools idxstats 4_bam/SRR4733912.sorted.bam \
    | cut -f1 \
    | grep -v Mt \
    | grep -v Pt \
    | xargs samtools view --threads 7 -b \
        4_bam/SRR4733912.sorted.bam \
    > 4_bam/SRR4733912.sorted.noorg.bam

samtools index 4_bam/SRR4733912.sorted.noorg.bam

# Verify: Mt and Pt should now show 0 reads
samtools idxstats 4_bam/SRR4733912.sorted.noorg.bam

Download:

Organelle removal — command-by-command

📊samtools idxstats	Prints a table: chromosome name, length, mapped reads, unmapped reads. The fastest way to see if your organellar reads dominate the dataset (they will!).
✂️cut -f1	Cuts just the first field (column 1 = chromosome name) from the idxstats output. This gives a list of all chromosome names in the BAM.
🚫grep -v Mt / grep -v Pt	The `-v` flag inverts the match — keep lines that do NOT contain "Mt" or "Pt". This filters the chromosome list to nuclear chromosomes only.
🔧xargs samtools view -b	`xargs` takes the list of chromosome names from the pipe and passes them as arguments to `samtools view`. Samtools then extracts only reads mapping to those chromosomes. The `-b` flag outputs BAM format directly.

🌿

The pipe (|) trick: The whole organelle-removal command is one pipeline — idxstats | cut | grep | grep | xargs samtools view. Each | passes the output of one command directly as input to the next, in memory, without writing any intermediate files. Pipelines are a superpower in bioinformatics!

Organelle removal is non-negotiable for plant ATAC-seq. In this dataset, chloroplast reads account for ~44M / ~51M total reads — over 85% of all alignments! Without removal, MACS2 would set its lambda background model using these organellar reads, making peak calling on nuclear chromosomes unreliable.

Step 5 — MACS2 Peak Calling

Bash — MACS2 Broad Peak Calling

module load macs2  # or: pip install macs2

mkdir -p 5_peaks

macs2 callpeak \
    -t 4_bam/SRR4733912.sorted.noorg.bam \
    -q 0.05 \
    --broad \
    -f BAMPE \
    -n SRR4733912 \
    -B \
    --trackline \
    --outdir 5_peaks \
    2> 5_peaks/SRR4733912.peak.log &

# Key parameters explained:
# -q 0.05    : FDR cutoff for peak calling
# --broad    : call broad peaks (ATAC open chromatin regions can be large)
# -f BAMPE   : BAM paired-end format (uses actual fragment size, not estimate)
# -B         : generate bedGraph pileup files for visualization
# --trackline: add UCSC browser track header

# Output files:
ls 5_peaks/
# SRR4733912_control_lambda.bdg  -- local lambda model
# SRR4733912_peaks.broadPeak     -- BED format peak calls
# SRR4733912_peaks.gappedPeak    -- gapped peak format
# SRR4733912_peaks.xls           -- peak statistics
# SRR4733912_treat_pileup.bdg    -- read pileup bedGraph

# Count peaks
wc -l 5_peaks/SRR4733912_peaks.broadPeak

# Show top peaks by score
sort -k7,7nr 5_peaks/SRR4733912_peaks.broadPeak | head -20

Download:

MACS2 callpeak flags decoded

🎯-t sample.bam	The treatment BAM file (your ATAC-seq data). MACS2 calls peaks here.
📉-q 0.05	FDR (false discovery rate) q-value cutoff. Peaks with q > 0.05 are discarded. Stricter (0.01) = fewer but higher confidence peaks.
🌊--broad	Call broad (merged) peaks instead of sharp narrow peaks. ATAC-seq open chromatin regions can span hundreds to thousands of base pairs, so broad mode fits better than the narrow mode used for TF ChIP-seq.
👫-f BAMPE	Input format: BAM Paired-End. MACS2 uses the actual fragment sizes inferred from read pairs rather than guessing from single reads. More accurate peak boundaries!
🏷️-n SRR4733912	Output file prefix name. All output files will start with this name.
📈-B	Output a BedGraph pileup file — needed to generate BigWig tracks for IGV.

MACS2 summary for this dataset (from the actual log):

MACS2 Parameters (actual output)

# Program:macs2 callpeak
# name            = SRR4733912
# format          = BAMPE
# ChIP-seq file   = SRR4733912.sorted.noorg.bam
# qvalue cutoff   = 5.00e-02
# Broad peak mode is on
# Paired-End mode is on
# mean fragment size = 139 bp from treatment
# fragments after filtering in treatment: 4002590

Step 6 — BigWig for IGV Visualization

Bash — bedGraph to BigWig conversion

module load bedtools
module load ucsc-tools  # or download bedClip + bedGraphToBigWig from UCSC

mkdir -p 6_bigwig

# a) Generate chromosome sizes file from the genome
bioawk -c fastx '{print $name, length($seq)}' 0_genome/At_Genome \
    > 6_bigwig/At_chr.sizes

# b) Clip bedGraph to correct chromosome boundaries
bedtools slop \
    -i 5_peaks/SRR4733912_treat_pileup.bdg \
    -g 6_bigwig/At_chr.sizes \
    -b 0 \
    | bedClip stdin 6_bigwig/At_chr.sizes \
      6_bigwig/SRR4733912_treat_pileup.clipped.bdg

# c) Sort the clipped bedGraph
sort -k1,1 -k2,2n 6_bigwig/SRR4733912_treat_pileup.clipped.bdg \
    > 6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg

# d) Convert to BigWig
bedGraphToBigWig \
    6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg \
    6_bigwig/At_chr.sizes \
    6_bigwig/SRR4733912_treat_pileup.bw

echo "BigWig file ready: 6_bigwig/SRR4733912_treat_pileup.bw"
echo "Load this file in IGV with TAIR10 genome to view ATAC peaks"

Download:

BigWig pipeline decoded

📏bioawk -c fastx	bioawk reads FASTA files (`-c fastx`) and lets you use `$name` and `$seq` variables. Here we print each chromosome name and its length to make a "chrom sizes" file required by UCSC tools.
✂️bedClip	Clips bedGraph entries to not exceed chromosome boundaries. MACS2 sometimes outputs coordinates slightly past the chromosome end due to smoothing — bedClip silently fixes this, preventing bedGraphToBigWig from crashing.
📊bedGraphToBigWig	Converts the text bedGraph format to the binary BigWig format. BigWig is ~10× smaller and loads instantly in IGV/UCSC browser compared to bedGraph. The chromosome sizes file is required to set the coordinate boundaries.
📑sort -k1,1 -k2,2n	Sorts the bedGraph by chromosome name (`-k1,1` = column 1, alphabetically) then by start position (`-k2,2n` = column 2, numerically). bedGraphToBigWig requires sorted input — unsorted input causes an immediate crash.

Viewing in IGV: Open IGV, select Arabidopsis thaliana (TAIR10) as the genome, then drag-and-drop the .bw file onto the tracks panel. Navigate to a known regulatory region (e.g., a gene promoter like FLC chr5:3,173,983) to see the pileup of open chromatin reads. Accessible regions appear as sharp peaks. Load the .broadPeak file alongside to see called peak boundaries.

Expected Results

File	Description	Expected Value
`SRR4733912.sorted.bam`	All aligned reads	~51M read pairs
`SRR4733912.sorted.noorg.bam`	Nuclear reads only	~4M read pairs (after Pt/Mt removal)
`SRR4733912_peaks.broadPeak`	Called open chromatin peaks	~15,000–25,000 peaks
`SRR4733912_treat_pileup.bw`	BigWig track for IGV	Peaks visible at gene promoters/enhancers

The bigwig file shows distinct peaks at promoters and enhancers. Comparing WT vs BRM mutant peaks reveals regions of altered chromatin accessibility regulated by the BRM remodeler.

Troubleshooting

MACS2: "No enough paired-end reads"

After organelle removal, too few fragments remain. Check: did you align to the correct genome? Are chromosome names consistent (TAIR10 uses "1","2" not "Chr1","Chr2")? Run samtools flagstat to verify paired-end reads are present.

bedGraphToBigWig: "out of bounds" or coordinate errors

The chromosome sizes file must exactly match the genome used for alignment. Generate it from the same FASTA used by bowtie2-build with bioawk -c fastx '{print $name, length($seq)}' genome.fa. Do not use a chromosome sizes file from UCSC if you aligned to Ensembl (different chromosome naming).

Very few peaks (<1000) after MACS2

Check organelle read removal — if you forgot this step, MACS2's background model is overwhelmed. Also verify -f BAMPE is set (not BAM) for paired-end data, and that the BAM contains paired reads (samtools flagstat should show properly paired reads).

ATAC-seq: Chromatin Accessibility Analysis

On this page

Overview

Ideal Directory Structure

Dataset

Step 1 — Download Data

Step 2 — Quality Check

Step 3 — Bowtie2 Alignment

Step 4 — Remove Organellar Reads (Critical for Plants)

Step 5 — MACS2 Peak Calling

Step 6 — BigWig for IGV Visualization

Expected Results

Troubleshooting

MACS2: "No enough paired-end reads"

bedGraphToBigWig: "out of bounds" or coordinate errors

Very few peaks (<1000) after MACS2