ATAC-seq: Chromatin Accessibility Analysis

From raw FASTQ to chromatin accessibility peaks using the Arabidopsis BRM chromatin remodeling dataset — bowtie2, MACS2, and IGV visualization.

~90 min PRJNA351855 Intermediate Bash
Byte

Overview

ATAC-seq (Assay for Transposase-Accessible Chromatin by sequencing) uses a hyperactive Tn5 transposase to tag and fragment open chromatin regions. Sequencing these fragments reveals genome-wide chromatin accessibility — a proxy for active regulatory elements.

Byte
Why remove organellar reads? Mitochondrial and chloroplast DNA in plants is highly accessible (no nucleosome packaging) and can dominate 20–80% of ATAC-seq reads. These reads represent contamination of the signal — removing them is essential before peak calling to avoid false peaks and to correctly normalize genomic accessibility.
StepToolPurpose
QCFastQCCheck read quality; adapters often pre-removed
Indexbowtie2-buildBuild genome index for alignment
Alignbowtie2Map reads to genome (ATAC uses generic aligner, not splice-aware)
Filtersamtools idxstats + viewRemove Mt and Pt (organellar) reads
Peak callMACS2Identify open chromatin regions
VisualizebedGraphToBigWig + IGVGenerate browser tracks

Ideal Directory Structure

Organize your ATAC-seq project like this. The numbered folder convention keeps every step's outputs separate and easy to re-run.

atacseq_brm_project/
atacseq_brm_project/ ├── 0_genome/ # genome FASTA + Bowtie2 index (.bt2 files) ├── 1_data/ # raw FASTQ (read-only! never modify) │ ├── SRR4733912_1.fastq.gz │ └── SRR4733912_2.fastq.gz ├── 2_fastqc/ # FastQC reports ├── 3_bam/ # all BAM files │ ├── SRR4733912.bam # raw alignment │ ├── SRR4733912.nuclear.bam # organelles removed │ └── SRR4733912.nuclear.bam.bai ├── 4_peaks/ # MACS2 peak files │ ├── WT_peaks.broadPeak │ └── WT_treat_pileup.bdg ├── 5_bigwig/ # normalized BigWig tracks for IGV │ └── WT.bw └── scripts/ # your .sh scripts
🌱
Plant ATAC-seq gotcha: The 0_genome/ folder needs both the nuclear chromosomes AND the organellar chromosomes (Mt, Pt) in the same FASTA for alignment — you need to align everything first, then filter organellar reads after. If you omit Mt/Pt from the genome, those reads will mis-map to nuclear chromosomes and create ghost peaks!

Dataset

This dataset comes from Jégu et al. (2017), studying how the BRM chromatin remodeler controls Arabidopsis seedling morphogenesis by modulating chromatin accessibility. The ATAC-seq libraries are paired-end and the adapters were removed prior to deposition.

SampleENA IDDescription
ATAC WT seedlingSRR4733912Paired-end, Nextera adapters pre-removed

BioProject: PRJNA351855

Step 1 — Download Data

Bash — Download ATAC-seq reads and Arabidopsis genome
mkdir -p 1_data 0_genome

# Download ATAC-seq paired-end reads
cd 1_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_2.fastq.gz
cd ..

# Download Arabidopsis TAIR10 genome from EnsemblPlants
cd 0_genome
wget http://ftp.ensemblgenomes.org/pub/plants/release-54/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ln -s Arabidopsis_thaliana.TAIR10.dna.toplevel.fa At_Genome
cd ..
Download:
What does each command do?
cat genome.fna Mt.fna Pt.fnaConcatenates the nuclear genome with the mitochondrial (Mt) and chloroplast (Pt) genomes into one single FASTA. Bowtie2 needs one combined index so reads from organelles are captured and can be removed later — rather than mismapping to the nuclear genome.
🗺️bowtie2-buildBuilds an FM-index of the combined genome. Unlike HISAT2, Bowtie2 does not handle splicing — perfect for ATAC-seq which comes from genomic DNA, not mRNA.
What is ln -s? Creates a symbolic link (shortcut) named At_Genome pointing to the long genome filename. Tools like bowtie2-build can then use the short name. If you ever regenerate the genome file, just update the symlink — all scripts keep working.

Step 2 — Quality Check

Bash — FastQC
module load fastqc
mkdir -p 2_fastqc
fastqc -o 2_fastqc 1_data/*.fastq.gz

# For this dataset: adapters were already removed by the authors
# Expect clean per-base quality and minimal adapter content
Download:

Step 3 — Bowtie2 Alignment

Bash — Bowtie2 Index + Alignment
module load bowtie2
module load samtools

# Build genome index (run once)
mkdir -p 3_bwt_index
bowtie2-build 0_genome/At_Genome 3_bwt_index/At.TAIR10
echo "Index complete. Files:"
ls 3_bwt_index/

# Align ATAC-seq reads
mkdir -p 4_bam
bowtie2 --threads 8 \
    -x 3_bwt_index/At.TAIR10 \
    -q \
    -1 1_data/SRR4733912_1.fastq.gz \
    -2 1_data/SRR4733912_2.fastq.gz \
    -S 4_bam/SRR4733912.sam \
    2> 4_bam/SRR4733912_alignment.log

cat 4_bam/SRR4733912_alignment.log  # Check alignment rate

# Convert to sorted BAM
samtools view --threads 7 -bS 4_bam/SRR4733912.sam \
    | samtools sort --threads 7 - > 4_bam/SRR4733912.sorted.bam

samtools index 4_bam/SRR4733912.sorted.bam
rm 4_bam/SRR4733912.sam   # save disk space

# View BAM header (verify chromosome names)
samtools view -H 4_bam/SRR4733912.sorted.bam | head -20
Download:
Bowtie2 flags decoded
🧵--threads 8Use 8 CPU threads to speed up alignment. Bowtie2 parallelizes very well — 8 threads is ~7× faster than single-threaded.
🗺️-x 3_bwt_index/At.TAIR10The genome index prefix. Bowtie2 looks for files named At.TAIR10.1.bt2, At.TAIR10.2.bt2 etc.
📖-qInput is in FASTQ format (the default, but explicit is clearer). Alternatively -f for FASTA.
👫-1 / -2Forward (R1) and reverse (R2) read files. Bowtie2 uses the paired information to improve alignment accuracy and correctly estimate insert sizes.
💾-S sample.samWrite output in SAM format. Like RNA-seq, we immediately convert to BAM and delete the SAM.
🔍samtools view -HPrints the BAM Header, which contains chromosome names and lengths. Always check this — MACS2 must get chromosome names from the BAM, and they must match your annotation file.

Step 4 — Remove Organellar Reads (Critical for Plants)

Bash — Remove Mt and Pt reads
module load samtools

# First, see the read counts per chromosome
# This shows how many reads align to Mt and Pt vs nuclear chromosomes
samtools idxstats 4_bam/SRR4733912.sorted.bam

# EXAMPLE OUTPUT (from actual data):
# 1       30427671   2018268  43655
# 2       19698289   2796545  47708
# 3       23459830   1738071  36100
# 4       18585056   1290746  26515
# 5       26975502   1795689  40246
# Mt      366924     2469591  35167     <-- ~2.4M reads = 20% of data!
# Pt      154478     44476436 524929    <-- ~44M reads = DOMINANT contamination

# Keep only nuclear chromosomes (exclude Mt and Pt)
samtools idxstats 4_bam/SRR4733912.sorted.bam \
    | cut -f1 \
    | grep -v Mt \
    | grep -v Pt \
    | xargs samtools view --threads 7 -b \
        4_bam/SRR4733912.sorted.bam \
    > 4_bam/SRR4733912.sorted.noorg.bam

samtools index 4_bam/SRR4733912.sorted.noorg.bam

# Verify: Mt and Pt should now show 0 reads
samtools idxstats 4_bam/SRR4733912.sorted.noorg.bam
Download:
Organelle removal — command-by-command
📊samtools idxstatsPrints a table: chromosome name, length, mapped reads, unmapped reads. The fastest way to see if your organellar reads dominate the dataset (they will!).
✂️cut -f1Cuts just the first field (column 1 = chromosome name) from the idxstats output. This gives a list of all chromosome names in the BAM.
🚫grep -v Mt / grep -v PtThe -v flag inverts the match — keep lines that do NOT contain "Mt" or "Pt". This filters the chromosome list to nuclear chromosomes only.
🔧xargs samtools view -bxargs takes the list of chromosome names from the pipe and passes them as arguments to samtools view. Samtools then extracts only reads mapping to those chromosomes. The -b flag outputs BAM format directly.
🌿
The pipe (|) trick: The whole organelle-removal command is one pipeline — idxstats | cut | grep | grep | xargs samtools view. Each | passes the output of one command directly as input to the next, in memory, without writing any intermediate files. Pipelines are a superpower in bioinformatics!
Organelle removal is non-negotiable for plant ATAC-seq. In this dataset, chloroplast reads account for ~44M / ~51M total reads — over 85% of all alignments! Without removal, MACS2 would set its lambda background model using these organellar reads, making peak calling on nuclear chromosomes unreliable.

Step 5 — MACS2 Peak Calling

Bash — MACS2 Broad Peak Calling
module load macs2  # or: pip install macs2

mkdir -p 5_peaks

macs2 callpeak \
    -t 4_bam/SRR4733912.sorted.noorg.bam \
    -q 0.05 \
    --broad \
    -f BAMPE \
    -n SRR4733912 \
    -B \
    --trackline \
    --outdir 5_peaks \
    2> 5_peaks/SRR4733912.peak.log &

# Key parameters explained:
# -q 0.05    : FDR cutoff for peak calling
# --broad    : call broad peaks (ATAC open chromatin regions can be large)
# -f BAMPE   : BAM paired-end format (uses actual fragment size, not estimate)
# -B         : generate bedGraph pileup files for visualization
# --trackline: add UCSC browser track header

# Output files:
ls 5_peaks/
# SRR4733912_control_lambda.bdg  -- local lambda model
# SRR4733912_peaks.broadPeak     -- BED format peak calls
# SRR4733912_peaks.gappedPeak    -- gapped peak format
# SRR4733912_peaks.xls           -- peak statistics
# SRR4733912_treat_pileup.bdg    -- read pileup bedGraph

# Count peaks
wc -l 5_peaks/SRR4733912_peaks.broadPeak

# Show top peaks by score
sort -k7,7nr 5_peaks/SRR4733912_peaks.broadPeak | head -20
Download:
MACS2 callpeak flags decoded
🎯-t sample.bamThe treatment BAM file (your ATAC-seq data). MACS2 calls peaks here.
📉-q 0.05FDR (false discovery rate) q-value cutoff. Peaks with q > 0.05 are discarded. Stricter (0.01) = fewer but higher confidence peaks.
🌊--broadCall broad (merged) peaks instead of sharp narrow peaks. ATAC-seq open chromatin regions can span hundreds to thousands of base pairs, so broad mode fits better than the narrow mode used for TF ChIP-seq.
👫-f BAMPEInput format: BAM Paired-End. MACS2 uses the actual fragment sizes inferred from read pairs rather than guessing from single reads. More accurate peak boundaries!
🏷️-n SRR4733912Output file prefix name. All output files will start with this name.
📈-BOutput a BedGraph pileup file — needed to generate BigWig tracks for IGV.

MACS2 summary for this dataset (from the actual log):

MACS2 Parameters (actual output)
# Program:macs2 callpeak
# name            = SRR4733912
# format          = BAMPE
# ChIP-seq file   = SRR4733912.sorted.noorg.bam
# qvalue cutoff   = 5.00e-02
# Broad peak mode is on
# Paired-End mode is on
# mean fragment size = 139 bp from treatment
# fragments after filtering in treatment: 4002590

Step 6 — BigWig for IGV Visualization

Bash — bedGraph to BigWig conversion
module load bedtools
module load ucsc-tools  # or download bedClip + bedGraphToBigWig from UCSC

mkdir -p 6_bigwig

# a) Generate chromosome sizes file from the genome
bioawk -c fastx '{print $name, length($seq)}' 0_genome/At_Genome \
    > 6_bigwig/At_chr.sizes

# b) Clip bedGraph to correct chromosome boundaries
bedtools slop \
    -i 5_peaks/SRR4733912_treat_pileup.bdg \
    -g 6_bigwig/At_chr.sizes \
    -b 0 \
    | bedClip stdin 6_bigwig/At_chr.sizes \
      6_bigwig/SRR4733912_treat_pileup.clipped.bdg

# c) Sort the clipped bedGraph
sort -k1,1 -k2,2n 6_bigwig/SRR4733912_treat_pileup.clipped.bdg \
    > 6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg

# d) Convert to BigWig
bedGraphToBigWig \
    6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg \
    6_bigwig/At_chr.sizes \
    6_bigwig/SRR4733912_treat_pileup.bw

echo "BigWig file ready: 6_bigwig/SRR4733912_treat_pileup.bw"
echo "Load this file in IGV with TAIR10 genome to view ATAC peaks"
Download:
BigWig pipeline decoded
📏bioawk -c fastxbioawk reads FASTA files (-c fastx) and lets you use $name and $seq variables. Here we print each chromosome name and its length to make a "chrom sizes" file required by UCSC tools.
✂️bedClipClips bedGraph entries to not exceed chromosome boundaries. MACS2 sometimes outputs coordinates slightly past the chromosome end due to smoothing — bedClip silently fixes this, preventing bedGraphToBigWig from crashing.
📊bedGraphToBigWigConverts the text bedGraph format to the binary BigWig format. BigWig is ~10× smaller and loads instantly in IGV/UCSC browser compared to bedGraph. The chromosome sizes file is required to set the coordinate boundaries.
📑sort -k1,1 -k2,2nSorts the bedGraph by chromosome name (-k1,1 = column 1, alphabetically) then by start position (-k2,2n = column 2, numerically). bedGraphToBigWig requires sorted input — unsorted input causes an immediate crash.
Byte
Viewing in IGV: Open IGV, select Arabidopsis thaliana (TAIR10) as the genome, then drag-and-drop the .bw file onto the tracks panel. Navigate to a known regulatory region (e.g., a gene promoter like FLC chr5:3,173,983) to see the pileup of open chromatin reads. Accessible regions appear as sharp peaks. Load the .broadPeak file alongside to see called peak boundaries.

Expected Results

FileDescriptionExpected Value
SRR4733912.sorted.bamAll aligned reads~51M read pairs
SRR4733912.sorted.noorg.bamNuclear reads only~4M read pairs (after Pt/Mt removal)
SRR4733912_peaks.broadPeakCalled open chromatin peaks~15,000–25,000 peaks
SRR4733912_treat_pileup.bwBigWig track for IGVPeaks visible at gene promoters/enhancers

The bigwig file shows distinct peaks at promoters and enhancers. Comparing WT vs BRM mutant peaks reveals regions of altered chromatin accessibility regulated by the BRM remodeler.

Troubleshooting

MACS2: "No enough paired-end reads"

After organelle removal, too few fragments remain. Check: did you align to the correct genome? Are chromosome names consistent (TAIR10 uses "1","2" not "Chr1","Chr2")? Run samtools flagstat to verify paired-end reads are present.

bedGraphToBigWig: "out of bounds" or coordinate errors

The chromosome sizes file must exactly match the genome used for alignment. Generate it from the same FASTA used by bowtie2-build with bioawk -c fastx '{print $name, length($seq)}' genome.fa. Do not use a chromosome sizes file from UCSC if you aligned to Ensembl (different chromosome naming).

Very few peaks (<1000) after MACS2

Check organelle read removal — if you forgot this step, MACS2's background model is overwhelmed. Also verify -f BAMPE is set (not BAM) for paired-end data, and that the BAM contains paired reads (samtools flagstat should show properly paired reads).