From raw FASTQ to chromatin accessibility peaks using the Arabidopsis BRM chromatin remodeling dataset — bowtie2, MACS2, and IGV visualization.
ATAC-seq (Assay for Transposase-Accessible Chromatin by sequencing) uses a hyperactive Tn5 transposase to tag and fragment open chromatin regions. Sequencing these fragments reveals genome-wide chromatin accessibility — a proxy for active regulatory elements.
| Step | Tool | Purpose |
|---|---|---|
| QC | FastQC | Check read quality; adapters often pre-removed |
| Index | bowtie2-build | Build genome index for alignment |
| Align | bowtie2 | Map reads to genome (ATAC uses generic aligner, not splice-aware) |
| Filter | samtools idxstats + view | Remove Mt and Pt (organellar) reads |
| Peak call | MACS2 | Identify open chromatin regions |
| Visualize | bedGraphToBigWig + IGV | Generate browser tracks |
Organize your ATAC-seq project like this. The numbered folder convention keeps every step's outputs separate and easy to re-run.
0_genome/ folder needs both the nuclear chromosomes AND the organellar chromosomes (Mt, Pt) in the same FASTA for alignment — you need to align everything first, then filter organellar reads after. If you omit Mt/Pt from the genome, those reads will mis-map to nuclear chromosomes and create ghost peaks!This dataset comes from Jégu et al. (2017), studying how the BRM chromatin remodeler controls Arabidopsis seedling morphogenesis by modulating chromatin accessibility. The ATAC-seq libraries are paired-end and the adapters were removed prior to deposition.
| Sample | ENA ID | Description |
|---|---|---|
| ATAC WT seedling | SRR4733912 | Paired-end, Nextera adapters pre-removed |
BioProject: PRJNA351855
mkdir -p 1_data 0_genome # Download ATAC-seq paired-end reads cd 1_data wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR473/002/SRR4733912/SRR4733912_2.fastq.gz cd .. # Download Arabidopsis TAIR10 genome from EnsemblPlants cd 0_genome wget http://ftp.ensemblgenomes.org/pub/plants/release-54/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz ln -s Arabidopsis_thaliana.TAIR10.dna.toplevel.fa At_Genome cd ..
| ➕cat genome.fna Mt.fna Pt.fna | Concatenates the nuclear genome with the mitochondrial (Mt) and chloroplast (Pt) genomes into one single FASTA. Bowtie2 needs one combined index so reads from organelles are captured and can be removed later — rather than mismapping to the nuclear genome. |
| 🗺️bowtie2-build | Builds an FM-index of the combined genome. Unlike HISAT2, Bowtie2 does not handle splicing — perfect for ATAC-seq which comes from genomic DNA, not mRNA. |
ln -s? Creates a symbolic link (shortcut) named At_Genome pointing to the long genome filename. Tools like bowtie2-build can then use the short name. If you ever regenerate the genome file, just update the symlink — all scripts keep working.module load fastqc mkdir -p 2_fastqc fastqc -o 2_fastqc 1_data/*.fastq.gz # For this dataset: adapters were already removed by the authors # Expect clean per-base quality and minimal adapter content
module load bowtie2
module load samtools
# Build genome index (run once)
mkdir -p 3_bwt_index
bowtie2-build 0_genome/At_Genome 3_bwt_index/At.TAIR10
echo "Index complete. Files:"
ls 3_bwt_index/
# Align ATAC-seq reads
mkdir -p 4_bam
bowtie2 --threads 8 \
-x 3_bwt_index/At.TAIR10 \
-q \
-1 1_data/SRR4733912_1.fastq.gz \
-2 1_data/SRR4733912_2.fastq.gz \
-S 4_bam/SRR4733912.sam \
2> 4_bam/SRR4733912_alignment.log
cat 4_bam/SRR4733912_alignment.log # Check alignment rate
# Convert to sorted BAM
samtools view --threads 7 -bS 4_bam/SRR4733912.sam \
| samtools sort --threads 7 - > 4_bam/SRR4733912.sorted.bam
samtools index 4_bam/SRR4733912.sorted.bam
rm 4_bam/SRR4733912.sam # save disk space
# View BAM header (verify chromosome names)
samtools view -H 4_bam/SRR4733912.sorted.bam | head -20
| 🧵--threads 8 | Use 8 CPU threads to speed up alignment. Bowtie2 parallelizes very well — 8 threads is ~7× faster than single-threaded. |
| 🗺️-x 3_bwt_index/At.TAIR10 | The genome index prefix. Bowtie2 looks for files named At.TAIR10.1.bt2, At.TAIR10.2.bt2 etc. |
| 📖-q | Input is in FASTQ format (the default, but explicit is clearer). Alternatively -f for FASTA. |
| 👫-1 / -2 | Forward (R1) and reverse (R2) read files. Bowtie2 uses the paired information to improve alignment accuracy and correctly estimate insert sizes. |
| 💾-S sample.sam | Write output in SAM format. Like RNA-seq, we immediately convert to BAM and delete the SAM. |
| 🔍samtools view -H | Prints the BAM Header, which contains chromosome names and lengths. Always check this — MACS2 must get chromosome names from the BAM, and they must match your annotation file. |
module load samtools
# First, see the read counts per chromosome
# This shows how many reads align to Mt and Pt vs nuclear chromosomes
samtools idxstats 4_bam/SRR4733912.sorted.bam
# EXAMPLE OUTPUT (from actual data):
# 1 30427671 2018268 43655
# 2 19698289 2796545 47708
# 3 23459830 1738071 36100
# 4 18585056 1290746 26515
# 5 26975502 1795689 40246
# Mt 366924 2469591 35167 <-- ~2.4M reads = 20% of data!
# Pt 154478 44476436 524929 <-- ~44M reads = DOMINANT contamination
# Keep only nuclear chromosomes (exclude Mt and Pt)
samtools idxstats 4_bam/SRR4733912.sorted.bam \
| cut -f1 \
| grep -v Mt \
| grep -v Pt \
| xargs samtools view --threads 7 -b \
4_bam/SRR4733912.sorted.bam \
> 4_bam/SRR4733912.sorted.noorg.bam
samtools index 4_bam/SRR4733912.sorted.noorg.bam
# Verify: Mt and Pt should now show 0 reads
samtools idxstats 4_bam/SRR4733912.sorted.noorg.bam
| 📊samtools idxstats | Prints a table: chromosome name, length, mapped reads, unmapped reads. The fastest way to see if your organellar reads dominate the dataset (they will!). |
| ✂️cut -f1 | Cuts just the first field (column 1 = chromosome name) from the idxstats output. This gives a list of all chromosome names in the BAM. |
| 🚫grep -v Mt / grep -v Pt | The -v flag inverts the match — keep lines that do NOT contain "Mt" or "Pt". This filters the chromosome list to nuclear chromosomes only. |
| 🔧xargs samtools view -b | xargs takes the list of chromosome names from the pipe and passes them as arguments to samtools view. Samtools then extracts only reads mapping to those chromosomes. The -b flag outputs BAM format directly. |
|) trick: The whole organelle-removal command is one pipeline — idxstats | cut | grep | grep | xargs samtools view. Each | passes the output of one command directly as input to the next, in memory, without writing any intermediate files. Pipelines are a superpower in bioinformatics!module load macs2 # or: pip install macs2
mkdir -p 5_peaks
macs2 callpeak \
-t 4_bam/SRR4733912.sorted.noorg.bam \
-q 0.05 \
--broad \
-f BAMPE \
-n SRR4733912 \
-B \
--trackline \
--outdir 5_peaks \
2> 5_peaks/SRR4733912.peak.log &
# Key parameters explained:
# -q 0.05 : FDR cutoff for peak calling
# --broad : call broad peaks (ATAC open chromatin regions can be large)
# -f BAMPE : BAM paired-end format (uses actual fragment size, not estimate)
# -B : generate bedGraph pileup files for visualization
# --trackline: add UCSC browser track header
# Output files:
ls 5_peaks/
# SRR4733912_control_lambda.bdg -- local lambda model
# SRR4733912_peaks.broadPeak -- BED format peak calls
# SRR4733912_peaks.gappedPeak -- gapped peak format
# SRR4733912_peaks.xls -- peak statistics
# SRR4733912_treat_pileup.bdg -- read pileup bedGraph
# Count peaks
wc -l 5_peaks/SRR4733912_peaks.broadPeak
# Show top peaks by score
sort -k7,7nr 5_peaks/SRR4733912_peaks.broadPeak | head -20
| 🎯-t sample.bam | The treatment BAM file (your ATAC-seq data). MACS2 calls peaks here. |
| 📉-q 0.05 | FDR (false discovery rate) q-value cutoff. Peaks with q > 0.05 are discarded. Stricter (0.01) = fewer but higher confidence peaks. |
| 🌊--broad | Call broad (merged) peaks instead of sharp narrow peaks. ATAC-seq open chromatin regions can span hundreds to thousands of base pairs, so broad mode fits better than the narrow mode used for TF ChIP-seq. |
| 👫-f BAMPE | Input format: BAM Paired-End. MACS2 uses the actual fragment sizes inferred from read pairs rather than guessing from single reads. More accurate peak boundaries! |
| 🏷️-n SRR4733912 | Output file prefix name. All output files will start with this name. |
| 📈-B | Output a BedGraph pileup file — needed to generate BigWig tracks for IGV. |
MACS2 summary for this dataset (from the actual log):
# Program:macs2 callpeak # name = SRR4733912 # format = BAMPE # ChIP-seq file = SRR4733912.sorted.noorg.bam # qvalue cutoff = 5.00e-02 # Broad peak mode is on # Paired-End mode is on # mean fragment size = 139 bp from treatment # fragments after filtering in treatment: 4002590
module load bedtools
module load ucsc-tools # or download bedClip + bedGraphToBigWig from UCSC
mkdir -p 6_bigwig
# a) Generate chromosome sizes file from the genome
bioawk -c fastx '{print $name, length($seq)}' 0_genome/At_Genome \
> 6_bigwig/At_chr.sizes
# b) Clip bedGraph to correct chromosome boundaries
bedtools slop \
-i 5_peaks/SRR4733912_treat_pileup.bdg \
-g 6_bigwig/At_chr.sizes \
-b 0 \
| bedClip stdin 6_bigwig/At_chr.sizes \
6_bigwig/SRR4733912_treat_pileup.clipped.bdg
# c) Sort the clipped bedGraph
sort -k1,1 -k2,2n 6_bigwig/SRR4733912_treat_pileup.clipped.bdg \
> 6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg
# d) Convert to BigWig
bedGraphToBigWig \
6_bigwig/SRR4733912_treat_pileup.clipped.sorted.bdg \
6_bigwig/At_chr.sizes \
6_bigwig/SRR4733912_treat_pileup.bw
echo "BigWig file ready: 6_bigwig/SRR4733912_treat_pileup.bw"
echo "Load this file in IGV with TAIR10 genome to view ATAC peaks"
| 📏bioawk -c fastx | bioawk reads FASTA files (-c fastx) and lets you use $name and $seq variables. Here we print each chromosome name and its length to make a "chrom sizes" file required by UCSC tools. |
| ✂️bedClip | Clips bedGraph entries to not exceed chromosome boundaries. MACS2 sometimes outputs coordinates slightly past the chromosome end due to smoothing — bedClip silently fixes this, preventing bedGraphToBigWig from crashing. |
| 📊bedGraphToBigWig | Converts the text bedGraph format to the binary BigWig format. BigWig is ~10× smaller and loads instantly in IGV/UCSC browser compared to bedGraph. The chromosome sizes file is required to set the coordinate boundaries. |
| 📑sort -k1,1 -k2,2n | Sorts the bedGraph by chromosome name (-k1,1 = column 1, alphabetically) then by start position (-k2,2n = column 2, numerically). bedGraphToBigWig requires sorted input — unsorted input causes an immediate crash. |
.bw file onto the tracks panel. Navigate to a known regulatory region (e.g., a gene promoter like FLC chr5:3,173,983) to see the pileup of open chromatin reads. Accessible regions appear as sharp peaks. Load the .broadPeak file alongside to see called peak boundaries.| File | Description | Expected Value |
|---|---|---|
SRR4733912.sorted.bam | All aligned reads | ~51M read pairs |
SRR4733912.sorted.noorg.bam | Nuclear reads only | ~4M read pairs (after Pt/Mt removal) |
SRR4733912_peaks.broadPeak | Called open chromatin peaks | ~15,000–25,000 peaks |
SRR4733912_treat_pileup.bw | BigWig track for IGV | Peaks visible at gene promoters/enhancers |
The bigwig file shows distinct peaks at promoters and enhancers. Comparing WT vs BRM mutant peaks reveals regions of altered chromatin accessibility regulated by the BRM remodeler.
After organelle removal, too few fragments remain. Check: did you align to the correct genome? Are chromosome names consistent (TAIR10 uses "1","2" not "Chr1","Chr2")? Run samtools flagstat to verify paired-end reads are present.
The chromosome sizes file must exactly match the genome used for alignment. Generate it from the same FASTA used by bowtie2-build with bioawk -c fastx '{print $name, length($seq)}' genome.fa. Do not use a chromosome sizes file from UCSC if you aligned to Ensembl (different chromosome naming).
Check organelle read removal — if you forgot this step, MACS2's background model is overwhelmed. Also verify -f BAMPE is set (not BAM) for paired-end data, and that the BAM contains paired reads (samtools flagstat should show properly paired reads).