Genome Assembly Pipeline

From raw long & short reads to a polished, scaffolded genome — hifiasm, Flye, SPAdes, Pilon, QUAST, BUSCO, and GenomeScope.

~75 min Intermediate Bash / CLI
Byte

1 Overview

Genome assembly reconstructs contiguous sequences (contigs) from millions of short or long reads. The choice of assembler depends on your sequencing technology:

TechnologyRead LengthBest AssemblerTypical N50
PacBio HiFi10–25 kb, Q30+hifiasmChromosome-level
ONT Nanopore10–100+ kb, Q10–20FlyeMulti-Mb
Illumina150 bp PESPAdes10–100 kb
HybridLong + ShortMaSuRCA / hybridSPAdesDepends on coverage
Thumb Rule: If you have HiFi reads at ≥30× coverage, hifiasm alone can give you near-complete chromosomes. For ONT, always polish with short reads afterward.

2 Prerequisites & Environment

Bash
# Create a dedicated conda environment
conda create -n assembly -c bioconda -c conda-forge \
  hifiasm flye spades quast busco pilon \
  samtools bwa-mem2 minimap2 jellyfish genomescope2 multiqc

conda activate assembly

# Verify installations
hifiasm --version
flye --version
spades.py --version

3 Genome Profiling with GenomeScope

Before assembly, estimate genome size, heterozygosity, and repeat content from k-mer frequencies.

Bash
# Count k-mers with jellyfish (k=21 is standard)
jellyfish count -C -m 21 -s 1G -t 16 -o kmer_counts.jf reads.fastq.gz

# Generate histogram
jellyfish histo kmer_counts.jf > kmer_histo.txt

# Run GenomeScope2
genomescope2 -i kmer_histo.txt -o genomescope_output -k 21
GenomeScope2 Output (typical plant genome) GenomeScope version 2.0 k = 21 property min max Heterozygosity 0.85% 0.92% Genome Haploid Length 487,234,102 489,012,445 Genome Repeat Length 198,456,000 201,234,000 Genome Unique Length 286,778,102 289,778,445 Model Fit 97.82% 98.14% Read Error Rate 0.32% 0.35%
Mnemonic

"Know How Repeats Scale" — KHRS

K-mer count → Heterozygosity → Repeat fraction → Size estimate. Always run GenomeScope before assembly.

4 hifiasm — PacBio HiFi Assembly

hifiasm is the gold standard for HiFi reads. It produces phased assemblies (haplotype-resolved) by default.

Bash
# Basic hifiasm assembly
hifiasm -o my_assembly -t 32 reads.hifi.fastq.gz

# Convert GFA to FASTA (primary assembly)
awk '/^S/{print ">"$2; print $3}' my_assembly.bp.p_ctg.gfa > primary_contigs.fa

# For Hi-C phasing (chromosome-level):
hifiasm -o my_assembly -t 32 \
  --h1 hic_R1.fastq.gz --h2 hic_R2.fastq.gz \
  reads.hifi.fastq.gz
Download:
ParameterDefaultWhen to Change
-t1Set to available CPU cores (16–64)
-l3Purge level: 0=none, 3=aggressive. Use 0 for inbred lines.
--h1/--h2noneAdd Hi-C reads for chromosome-scale phasing
-s0.55Similarity threshold for purging. Lower for higher heterozygosity.

5 Flye — ONT / CLR Assembly

Bash
# ONT reads
flye --nano-hq ont_reads.fastq.gz \
  --out-dir flye_output \
  --threads 32 \
  --genome-size 500m

# PacBio CLR reads
flye --pacbio-raw clr_reads.fastq.gz \
  --out-dir flye_clr \
  --threads 32 \
  --genome-size 500m
Download:
Thumb Rule: Use --nano-hq for Guppy/Dorado SUP basecalled reads (Q20+). Use --nano-raw for fast basecalling. Wrong flag = much worse assembly.

6 SPAdes — Illumina Short-Read Assembly

Bash
# Standard paired-end assembly
spades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
  -o spades_output -t 16 -m 128 --careful

# For metagenomics
metaspades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
  -o metaspades_output -t 16
Download:

7 Assembly Polishing

Long-read assemblies (especially ONT) benefit from short-read polishing to fix base-level errors.

Bash
# Step 1: Align short reads to assembly
bwa-mem2 index assembly.fasta
bwa-mem2 mem -t 16 assembly.fasta reads_R1.fq.gz reads_R2.fq.gz \
  | samtools sort -@ 8 -o aligned.bam
samtools index aligned.bam

# Step 2: Run Pilon
java -Xmx64G -jar pilon.jar \
  --genome assembly.fasta \
  --frags aligned.bam \
  --output polished \
  --changes --threads 16

# Step 3: Iterate 2-3 rounds (diminishing returns after that)
# Repeat alignment + pilon on polished.fasta
Download:
Mnemonic

"Assemble → Polish → Quality-check" — APQ

Always follow this order. Never skip polishing for ONT assemblies. HiFi assemblies may skip polishing (already Q30+).

8 QUAST — Assembly Quality Metrics

Bash
# Compare multiple assemblies
quast -o quast_results \
  -r reference.fasta \
  -g genes.gff \
  -t 8 \
  hifiasm_contigs.fa flye_contigs.fa spades_scaffolds.fa

# Without reference
quast -o quast_results -t 8 assembly.fasta
QUAST Output (example) Assembly hifiasm Flye SPAdes # contigs 45 312 4,521 Total length 498,234,102 501,456,789 485,123,456 Largest contig 28,456,789 8,234,567 234,567 N50 18,234,567 2,345,678 45,678 N75 12,456,789 1,234,567 23,456 L50 11 56 2,345 GC (%) 36.42 36.45 36.38 # misassemblies 3 28 156 Genome fraction (%) 98.7 97.2 94.1
MetricWhat It MeansGood Values
N5050% of assembly in contigs ≥ this sizeHigher = better. Chromosome-level: ≥10 Mb
L50Fewest contigs covering 50% of assemblyLower = better. Ideal: = chromosome count
Genome fraction% of reference covered≥95% for good assembly
MisassembliesStructural errors vs referenceLower = better. <10 for HiFi

9 BUSCO — Gene Completeness

Bash
# Run BUSCO with appropriate lineage
busco -i assembly.fasta \
  -o busco_results \
  -m genome \
  -l embryophyta_odb10 \
  -c 16

# Common lineages:
# embryophyta_odb10  (plants)
# mammalia_odb10     (mammals)
# vertebrata_odb10   (vertebrates)
# insecta_odb10      (insects)
# fungi_odb10        (fungi)
# bacteria_odb10     (bacteria)
BUSCO Results (good plant assembly) C:97.2%[S:94.8%,D:2.4%],F:1.3%,M:1.5%,n:1614 1569 Complete BUSCOs (C) 1530 Complete and single-copy BUSCOs (S) 39 Complete and duplicated BUSCOs (D) 21 Fragmented BUSCOs (F) 24 Missing BUSCOs (M) 1614 Total BUSCO groups searched
Thumb Rule: C ≥ 95% = excellent. C 90–95% = good. C < 85% = investigate missing regions. High D (duplicated) may indicate un-purged haplotigs.

10 Scaffolding Strategies

YAHS

Hi-C scaffolding (recommended, fast)

Hi-C
3D-DNA

Aiden lab Hi-C scaffolder + Juicebox for manual curation

Hi-C
SALSA2

Hi-C scaffolding with mis-join correction

Hi-C
RagTag

Reference-guided scaffolding

Reference
ntLink

Long-read scaffolding without alignment

Long-Read
Pyscaf

Synteny-based scaffolding

Synteny

11 AI Prompt Guide

Adapt Assembly Script to Your Data

Copy and fill in the brackets:

I have [TECHNOLOGY: PacBio HiFi / ONT / Illumina] reads for [ORGANISM]. Estimated genome size: [SIZE, e.g., 500 Mb]. Estimated heterozygosity: [LOW (<0.5%) / MEDIUM (0.5-2%) / HIGH (>2%)]. Read coverage: [COVERAGE]x. Ploidy: [DIPLOID / POLYPLOID / HAPLOID]. Please write a complete SLURM script that: 1. Runs the appropriate assembler with optimal parameters 2. Polishes if needed 3. Runs QUAST and BUSCO for quality assessment 4. Uses [LINEAGE, e.g., embryophyta_odb10] for BUSCO 5. Includes proper SLURM headers for [CORES] CPUs and [RAM] GB RAM

12 Common Errors & Troubleshooting

hifiasm: "Killed" or OOM

hifiasm loads all reads into memory. For a 500 Mb genome at 30×, expect ~100 GB RAM. Request enough memory in SLURM: #SBATCH --mem=120G.

Flye: Very fragmented assembly

Usually low coverage (<20×). Check with --genome-size. Wrong value misleads the assembler. Also try --min-overlap 3000 for noisy reads.

SPAdes: "Not enough memory"

Use -m to set memory limit. For large genomes (>1 Gb), SPAdes is not recommended — switch to MEGAHIT or a long-read assembler.

BUSCO: High duplicated % (D > 10%)

Likely un-purged haplotigs. Run purge_dups or use hifiasm -l 3. For polyploids, high D is expected and normal.

13 Organism-Specific Notes

Plants
  • Highly repetitive genomes (>70% repeats) — long reads essential
  • Polyploidy common — use hifiasm --primary or set appropriate ploidy in Flye
  • Chloroplast/mitochondria contamination — filter organellar reads first or assemble separately
  • BUSCO lineage: embryophyta_odb10 or eudicots_odb10
Animals
  • Sex chromosomes may assemble poorly — consider separate assembly
  • Heterozygosity varies widely (inbred lab strains vs. wild populations)
  • BUSCO: mammalia_odb10, insecta_odb10, vertebrata_odb10
Microbes
  • Small genomes — SPAdes or Flye with --meta works well
  • Circular chromosomes — Flye handles circularization automatically
  • BUSCO: bacteria_odb10 or fungi_odb10

14 Complete Assembler Comparison

hifiasm

Best for PacBio HiFi. Phased by default.

HiFi
Flye

Best for ONT. Handles high error rate.

ONT/CLR
SPAdes

Best short-read assembler for small genomes.

Illumina
Verkko

Telomere-to-telomere from HiFi+ONT.

T2T
MaSuRCA

Hybrid assembler (long+short reads).

Hybrid
Canu

Original long-read assembler. Slower than Flye.

Long-Read
wtdbg2 (Redbean)

Ultra-fast but less accurate.

Long-Read
MEGAHIT

Memory-efficient short-read assembler.

Illumina
Shasta

Fast ONT assembler (single machine).

ONT
NextDenovo

Fast long-read assembler + NextPolish.

Long-Read

Mnemonics & Thumb Rules

Mnemonic

"Nifty Fifty" — N50

Sort contigs largest→smallest. Walk down until you've covered 50% of total length. That contig's size = N50. Think of it as the "median-weighted" contig size.

Thumb Rule

Coverage: "30-30-30"

30× HiFi = excellent assembly. 30× ONT + 30× Illumina = good hybrid. Below 20× of any technology = expect fragmentation.

Mnemonic

"Better Assemblies Need Quality" — BANQ

BUSCO → Assembly stats → N50/L50 → QUAST. Run all four checks on every assembly.

Thumb Rule — Assembler Choice: HiFi? → hifiasm. ONT? → Flye. Illumina only? → SPAdes (small) or MEGAHIT (large). Both long+short? → MaSuRCA or assemble long then polish with short.

Summary

Byte
You can now:
  1. Profile genomes with GenomeScope before assembly
  2. Run hifiasm, Flye, or SPAdes for your technology
  3. Polish assemblies with Pilon
  4. Evaluate with QUAST metrics and BUSCO completeness
  5. Scaffold to chromosome-level with Hi-C