Genome Assembly — hifiasm, Flye, SPAdes, QUAST & BUSCO

1 Overview

Genome assembly reconstructs contiguous sequences (contigs) from millions of short or long reads. The choice of assembler depends on your sequencing technology:

Technology	Read Length	Best Assembler	Typical N50
PacBio HiFi	10–25 kb, Q30+	hifiasm	Chromosome-level
ONT Nanopore	10–100+ kb, Q10–20	Flye	Multi-Mb
Illumina	150 bp PE	SPAdes	10–100 kb
Hybrid	Long + Short	MaSuRCA / hybridSPAdes	Depends on coverage

Thumb Rule: If you have HiFi reads at ≥30× coverage, hifiasm alone can give you near-complete chromosomes. For ONT, always polish with short reads afterward.

2 Prerequisites & Environment

Bash

# Create a dedicated conda environment
conda create -n assembly -c bioconda -c conda-forge \
  hifiasm flye spades quast busco pilon \
  samtools bwa-mem2 minimap2 jellyfish genomescope2 multiqc

conda activate assembly

# Verify installations
hifiasm --version
flye --version
spades.py --version

3 Genome Profiling with GenomeScope

Before assembly, estimate genome size, heterozygosity, and repeat content from k-mer frequencies.

Bash

# Count k-mers with jellyfish (k=21 is standard)
jellyfish count -C -m 21 -s 1G -t 16 -o kmer_counts.jf reads.fastq.gz

# Generate histogram
jellyfish histo kmer_counts.jf > kmer_histo.txt

# Run GenomeScope2
genomescope2 -i kmer_histo.txt -o genomescope_output -k 21

GenomeScope2 Output (typical plant genome) GenomeScope version 2.0 k = 21 property min max Heterozygosity 0.85% 0.92% Genome Haploid Length 487,234,102 489,012,445 Genome Repeat Length 198,456,000 201,234,000 Genome Unique Length 286,778,102 289,778,445 Model Fit 97.82% 98.14% Read Error Rate 0.32% 0.35%

Mnemonic

"Know How Repeats Scale" — KHRS

K-mer count → Heterozygosity → Repeat fraction → Size estimate. Always run GenomeScope before assembly.

4 hifiasm — PacBio HiFi Assembly

hifiasm is the gold standard for HiFi reads. It produces phased assemblies (haplotype-resolved) by default.

Bash

# Basic hifiasm assembly
hifiasm -o my_assembly -t 32 reads.hifi.fastq.gz

# Convert GFA to FASTA (primary assembly)
awk '/^S/{print ">"$2; print $3}' my_assembly.bp.p_ctg.gfa > primary_contigs.fa

# For Hi-C phasing (chromosome-level):
hifiasm -o my_assembly -t 32 \
  --h1 hic_R1.fastq.gz --h2 hic_R2.fastq.gz \
  reads.hifi.fastq.gz

Download:

Parameter	Default	When to Change
`-t`	1	Set to available CPU cores (16–64)
`-l`	3	Purge level: 0=none, 3=aggressive. Use 0 for inbred lines.
`--h1/--h2`	none	Add Hi-C reads for chromosome-scale phasing
`-s`	0.55	Similarity threshold for purging. Lower for higher heterozygosity.

5 Flye — ONT / CLR Assembly

Bash

# ONT reads
flye --nano-hq ont_reads.fastq.gz \
  --out-dir flye_output \
  --threads 32 \
  --genome-size 500m

# PacBio CLR reads
flye --pacbio-raw clr_reads.fastq.gz \
  --out-dir flye_clr \
  --threads 32 \
  --genome-size 500m

Download:

Thumb Rule: Use --nano-hq for Guppy/Dorado SUP basecalled reads (Q20+). Use --nano-raw for fast basecalling. Wrong flag = much worse assembly.

6 SPAdes — Illumina Short-Read Assembly

Bash

# Standard paired-end assembly
spades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
  -o spades_output -t 16 -m 128 --careful

# For metagenomics
metaspades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
  -o metaspades_output -t 16

Download:

7 Assembly Polishing

Long-read assemblies (especially ONT) benefit from short-read polishing to fix base-level errors.

Bash

# Step 1: Align short reads to assembly
bwa-mem2 index assembly.fasta
bwa-mem2 mem -t 16 assembly.fasta reads_R1.fq.gz reads_R2.fq.gz \
  | samtools sort -@ 8 -o aligned.bam
samtools index aligned.bam

# Step 2: Run Pilon
java -Xmx64G -jar pilon.jar \
  --genome assembly.fasta \
  --frags aligned.bam \
  --output polished \
  --changes --threads 16

# Step 3: Iterate 2-3 rounds (diminishing returns after that)
# Repeat alignment + pilon on polished.fasta

Download:

Mnemonic

"Assemble → Polish → Quality-check" — APQ

Always follow this order. Never skip polishing for ONT assemblies. HiFi assemblies may skip polishing (already Q30+).

8 QUAST — Assembly Quality Metrics

Bash

# Compare multiple assemblies
quast -o quast_results \
  -r reference.fasta \
  -g genes.gff \
  -t 8 \
  hifiasm_contigs.fa flye_contigs.fa spades_scaffolds.fa

# Without reference
quast -o quast_results -t 8 assembly.fasta

QUAST Output (example) Assembly hifiasm Flye SPAdes # contigs 45 312 4,521 Total length 498,234,102 501,456,789 485,123,456 Largest contig 28,456,789 8,234,567 234,567 N50 18,234,567 2,345,678 45,678 N75 12,456,789 1,234,567 23,456 L50 11 56 2,345 GC (%) 36.42 36.45 36.38 # misassemblies 3 28 156 Genome fraction (%) 98.7 97.2 94.1

Metric	What It Means	Good Values
N50	50% of assembly in contigs ≥ this size	Higher = better. Chromosome-level: ≥10 Mb
L50	Fewest contigs covering 50% of assembly	Lower = better. Ideal: = chromosome count
Genome fraction	% of reference covered	≥95% for good assembly
Misassemblies	Structural errors vs reference	Lower = better. <10 for HiFi

9 BUSCO — Gene Completeness

Bash

# Run BUSCO with appropriate lineage
busco -i assembly.fasta \
  -o busco_results \
  -m genome \
  -l embryophyta_odb10 \
  -c 16

# Common lineages:
# embryophyta_odb10  (plants)
# mammalia_odb10     (mammals)
# vertebrata_odb10   (vertebrates)
# insecta_odb10      (insects)
# fungi_odb10        (fungi)
# bacteria_odb10     (bacteria)

BUSCO Results (good plant assembly) C:97.2%[S:94.8%,D:2.4%],F:1.3%,M:1.5%,n:1614 1569 Complete BUSCOs (C) 1530 Complete and single-copy BUSCOs (S) 39 Complete and duplicated BUSCOs (D) 21 Fragmented BUSCOs (F) 24 Missing BUSCOs (M) 1614 Total BUSCO groups searched

Thumb Rule: C ≥ 95% = excellent. C 90–95% = good. C < 85% = investigate missing regions. High D (duplicated) may indicate un-purged haplotigs.

10 Scaffolding Strategies

YAHS

Hi-C scaffolding (recommended, fast)

Hi-C

3D-DNA

Aiden lab Hi-C scaffolder + Juicebox for manual curation

Hi-C

SALSA2

Hi-C scaffolding with mis-join correction

Hi-C

RagTag

Reference-guided scaffolding

Reference

ntLink

Long-read scaffolding without alignment

Long-Read

Pyscaf

Synteny-based scaffolding

Synteny

11 AI Prompt Guide

Adapt Assembly Script to Your Data

Copy and fill in the brackets:

I have [TECHNOLOGY: PacBio HiFi / ONT / Illumina] reads for [ORGANISM]. Estimated genome size: [SIZE, e.g., 500 Mb]. Estimated heterozygosity: [LOW (<0.5%) / MEDIUM (0.5-2%) / HIGH (>2%)]. Read coverage: [COVERAGE]x. Ploidy: [DIPLOID / POLYPLOID / HAPLOID]. Please write a complete SLURM script that: 1. Runs the appropriate assembler with optimal parameters 2. Polishes if needed 3. Runs QUAST and BUSCO for quality assessment 4. Uses [LINEAGE, e.g., embryophyta_odb10] for BUSCO 5. Includes proper SLURM headers for [CORES] CPUs and [RAM] GB RAM

12 Common Errors & Troubleshooting

hifiasm: "Killed" or OOM

hifiasm loads all reads into memory. For a 500 Mb genome at 30×, expect ~100 GB RAM. Request enough memory in SLURM: #SBATCH --mem=120G.

Flye: Very fragmented assembly

Usually low coverage (<20×). Check with --genome-size. Wrong value misleads the assembler. Also try --min-overlap 3000 for noisy reads.

SPAdes: "Not enough memory"

Use -m to set memory limit. For large genomes (>1 Gb), SPAdes is not recommended — switch to MEGAHIT or a long-read assembler.

BUSCO: High duplicated % (D > 10%)

Likely un-purged haplotigs. Run purge_dups or use hifiasm -l 3. For polyploids, high D is expected and normal.

13 Organism-Specific Notes

Plants

Highly repetitive genomes (>70% repeats) — long reads essential
Polyploidy common — use hifiasm --primary or set appropriate ploidy in Flye
Chloroplast/mitochondria contamination — filter organellar reads first or assemble separately
BUSCO lineage: embryophyta_odb10 or eudicots_odb10

Animals

Sex chromosomes may assemble poorly — consider separate assembly
Heterozygosity varies widely (inbred lab strains vs. wild populations)
BUSCO: mammalia_odb10, insecta_odb10, vertebrata_odb10

Microbes

Small genomes — SPAdes or Flye with --meta works well
Circular chromosomes — Flye handles circularization automatically
BUSCO: bacteria_odb10 or fungi_odb10

14 Complete Assembler Comparison

hifiasm

Best for PacBio HiFi. Phased by default.

HiFi

Flye

Best for ONT. Handles high error rate.

ONT/CLR

SPAdes

Best short-read assembler for small genomes.

Illumina

Verkko

Telomere-to-telomere from HiFi+ONT.

T2T

MaSuRCA

Hybrid assembler (long+short reads).

Hybrid

Canu

Original long-read assembler. Slower than Flye.

Long-Read

wtdbg2 (Redbean)

Ultra-fast but less accurate.

Long-Read

MEGAHIT

Memory-efficient short-read assembler.

Illumina

Shasta

Fast ONT assembler (single machine).

ONT

NextDenovo

Fast long-read assembler + NextPolish.

Long-Read

Mnemonics & Thumb Rules

Mnemonic

"Nifty Fifty" — N50

Sort contigs largest→smallest. Walk down until you've covered 50% of total length. That contig's size = N50. Think of it as the "median-weighted" contig size.

Thumb Rule

Coverage: "30-30-30"

30× HiFi = excellent assembly. 30× ONT + 30× Illumina = good hybrid. Below 20× of any technology = expect fragmentation.

Mnemonic

"Better Assemblies Need Quality" — BANQ

BUSCO → Assembly stats → N50/L50 → QUAST. Run all four checks on every assembly.

Thumb Rule — Assembler Choice: HiFi? → hifiasm. ONT? → Flye. Illumina only? → SPAdes (small) or MEGAHIT (large). Both long+short? → MaSuRCA or assemble long then polish with short.

Summary

You can now:

Profile genomes with GenomeScope before assembly
Run hifiasm, Flye, or SPAdes for your technology
Polish assemblies with Pilon
Evaluate with QUAST metrics and BUSCO completeness
Scaffold to chromosome-level with Hi-C

All Tutorials Next: Genome Annotation

Genome Assembly Pipeline

On this page

1 Overview

2 Prerequisites & Environment

3 Genome Profiling with GenomeScope

4 hifiasm — PacBio HiFi Assembly

5 Flye — ONT / CLR Assembly

6 SPAdes — Illumina Short-Read Assembly

7 Assembly Polishing

8 QUAST — Assembly Quality Metrics

9 BUSCO — Gene Completeness

10 Scaffolding Strategies

YAHS

3D-DNA

SALSA2

RagTag

ntLink

Pyscaf

11 AI Prompt Guide

Adapt Assembly Script to Your Data

12 Common Errors & Troubleshooting

hifiasm: "Killed" or OOM

Flye: Very fragmented assembly

SPAdes: "Not enough memory"

BUSCO: High duplicated % (D > 10%)

13 Organism-Specific Notes

Plants

Animals

Microbes

14 Complete Assembler Comparison

hifiasm

Flye

SPAdes

Verkko

MaSuRCA

Canu

wtdbg2 (Redbean)

MEGAHIT

Shasta

NextDenovo

Mnemonics & Thumb Rules

Summary