From raw long & short reads to a polished, scaffolded genome — hifiasm, Flye, SPAdes, Pilon, QUAST, BUSCO, and GenomeScope.
Genome assembly reconstructs contiguous sequences (contigs) from millions of short or long reads. The choice of assembler depends on your sequencing technology:
| Technology | Read Length | Best Assembler | Typical N50 |
|---|---|---|---|
| PacBio HiFi | 10–25 kb, Q30+ | hifiasm | Chromosome-level |
| ONT Nanopore | 10–100+ kb, Q10–20 | Flye | Multi-Mb |
| Illumina | 150 bp PE | SPAdes | 10–100 kb |
| Hybrid | Long + Short | MaSuRCA / hybridSPAdes | Depends on coverage |
# Create a dedicated conda environment conda create -n assembly -c bioconda -c conda-forge \ hifiasm flye spades quast busco pilon \ samtools bwa-mem2 minimap2 jellyfish genomescope2 multiqc conda activate assembly # Verify installations hifiasm --version flye --version spades.py --version
Before assembly, estimate genome size, heterozygosity, and repeat content from k-mer frequencies.
# Count k-mers with jellyfish (k=21 is standard) jellyfish count -C -m 21 -s 1G -t 16 -o kmer_counts.jf reads.fastq.gz # Generate histogram jellyfish histo kmer_counts.jf > kmer_histo.txt # Run GenomeScope2 genomescope2 -i kmer_histo.txt -o genomescope_output -k 21
"Know How Repeats Scale" — KHRS
K-mer count → Heterozygosity → Repeat fraction → Size estimate. Always run GenomeScope before assembly.
hifiasm is the gold standard for HiFi reads. It produces phased assemblies (haplotype-resolved) by default.
# Basic hifiasm assembly
hifiasm -o my_assembly -t 32 reads.hifi.fastq.gz
# Convert GFA to FASTA (primary assembly)
awk '/^S/{print ">"$2; print $3}' my_assembly.bp.p_ctg.gfa > primary_contigs.fa
# For Hi-C phasing (chromosome-level):
hifiasm -o my_assembly -t 32 \
--h1 hic_R1.fastq.gz --h2 hic_R2.fastq.gz \
reads.hifi.fastq.gz
| Parameter | Default | When to Change |
|---|---|---|
-t | 1 | Set to available CPU cores (16–64) |
-l | 3 | Purge level: 0=none, 3=aggressive. Use 0 for inbred lines. |
--h1/--h2 | none | Add Hi-C reads for chromosome-scale phasing |
-s | 0.55 | Similarity threshold for purging. Lower for higher heterozygosity. |
# ONT reads flye --nano-hq ont_reads.fastq.gz \ --out-dir flye_output \ --threads 32 \ --genome-size 500m # PacBio CLR reads flye --pacbio-raw clr_reads.fastq.gz \ --out-dir flye_clr \ --threads 32 \ --genome-size 500m
--nano-hq for Guppy/Dorado SUP basecalled reads (Q20+). Use --nano-raw for fast basecalling. Wrong flag = much worse assembly.
# Standard paired-end assembly spades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \ -o spades_output -t 16 -m 128 --careful # For metagenomics metaspades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \ -o metaspades_output -t 16
Long-read assemblies (especially ONT) benefit from short-read polishing to fix base-level errors.
# Step 1: Align short reads to assembly bwa-mem2 index assembly.fasta bwa-mem2 mem -t 16 assembly.fasta reads_R1.fq.gz reads_R2.fq.gz \ | samtools sort -@ 8 -o aligned.bam samtools index aligned.bam # Step 2: Run Pilon java -Xmx64G -jar pilon.jar \ --genome assembly.fasta \ --frags aligned.bam \ --output polished \ --changes --threads 16 # Step 3: Iterate 2-3 rounds (diminishing returns after that) # Repeat alignment + pilon on polished.fasta
"Assemble → Polish → Quality-check" — APQ
Always follow this order. Never skip polishing for ONT assemblies. HiFi assemblies may skip polishing (already Q30+).
# Compare multiple assemblies quast -o quast_results \ -r reference.fasta \ -g genes.gff \ -t 8 \ hifiasm_contigs.fa flye_contigs.fa spades_scaffolds.fa # Without reference quast -o quast_results -t 8 assembly.fasta
| Metric | What It Means | Good Values |
|---|---|---|
| N50 | 50% of assembly in contigs ≥ this size | Higher = better. Chromosome-level: ≥10 Mb |
| L50 | Fewest contigs covering 50% of assembly | Lower = better. Ideal: = chromosome count |
| Genome fraction | % of reference covered | ≥95% for good assembly |
| Misassemblies | Structural errors vs reference | Lower = better. <10 for HiFi |
# Run BUSCO with appropriate lineage busco -i assembly.fasta \ -o busco_results \ -m genome \ -l embryophyta_odb10 \ -c 16 # Common lineages: # embryophyta_odb10 (plants) # mammalia_odb10 (mammals) # vertebrata_odb10 (vertebrates) # insecta_odb10 (insects) # fungi_odb10 (fungi) # bacteria_odb10 (bacteria)
Hi-C scaffolding (recommended, fast)
Hi-CAiden lab Hi-C scaffolder + Juicebox for manual curation
Hi-CHi-C scaffolding with mis-join correction
Hi-CReference-guided scaffolding
ReferenceLong-read scaffolding without alignment
Long-ReadSynteny-based scaffolding
SyntenyCopy and fill in the brackets:
hifiasm loads all reads into memory. For a 500 Mb genome at 30×, expect ~100 GB RAM. Request enough memory in SLURM: #SBATCH --mem=120G.
Usually low coverage (<20×). Check with --genome-size. Wrong value misleads the assembler. Also try --min-overlap 3000 for noisy reads.
Use -m to set memory limit. For large genomes (>1 Gb), SPAdes is not recommended — switch to MEGAHIT or a long-read assembler.
Likely un-purged haplotigs. Run purge_dups or use hifiasm -l 3. For polyploids, high D is expected and normal.
hifiasm --primary or set appropriate ploidy in Flyeembryophyta_odb10 or eudicots_odb10mammalia_odb10, insecta_odb10, vertebrata_odb10--meta works wellbacteria_odb10 or fungi_odb10Best for PacBio HiFi. Phased by default.
HiFiBest for ONT. Handles high error rate.
ONT/CLRBest short-read assembler for small genomes.
IlluminaTelomere-to-telomere from HiFi+ONT.
T2THybrid assembler (long+short reads).
HybridOriginal long-read assembler. Slower than Flye.
Long-ReadUltra-fast but less accurate.
Long-ReadMemory-efficient short-read assembler.
IlluminaFast ONT assembler (single machine).
ONTFast long-read assembler + NextPolish.
Long-Read"Nifty Fifty" — N50
Sort contigs largest→smallest. Walk down until you've covered 50% of total length. That contig's size = N50. Think of it as the "median-weighted" contig size.
Coverage: "30-30-30"
30× HiFi = excellent assembly. 30× ONT + 30× Illumina = good hybrid. Below 20× of any technology = expect fragmentation.
"Better Assemblies Need Quality" — BANQ
BUSCO → Assembly stats → N50/L50 → QUAST. Run all four checks on every assembly.