Genome Assembly: k-mer Analysis, GenomeScope & De Novo Assembly

Estimate genome properties before assembly using k-mer frequency profiles, then perform de novo assembly with SPAdes (bacteria/small genomes) or Canu (large eukaryotic genomes).

~120 min Intermediate Bash
Byte

Overview

Before spending days running a genome assembler, always perform k-mer analysis to understand your genome's properties: estimated size, ploidy level, heterozygosity rate, and repeat content. This informs which assembler and parameters to use.

Which Assembler Should I Use?
  • Prokaryote (small, low complexity): SPAdes — fast, accurate, handles variable coverage
  • Eukaryote, short reads only: MaSuRCA — good for complex genomes
  • Eukaryote, long reads only: Canu or Flye — designed for PacBio/Nanopore
  • Eukaryote, both short + long reads: MaSuRCA hybrid mode
  • Highly heterozygous diploid: Platanus or hifiasm (for HiFi reads)

Ideal Directory Structure

Genome assembly projects accumulate large files quickly. Keep each step isolated so you can re-run or compare assembler outputs independently.

genome_assembly_project/
genome_assembly_project/ ├── 0_raw_reads/ # raw FASTQ reads (read-only!) │ ├── reads_R1.fastq.gz │ └── reads_R2.fastq.gz ├── 1_kmer/ # Jellyfish output + histogram │ ├── reads.jf # binary k-mer hash (large!) │ └── reads.histo # frequency histogram (text, small) ├── 2_genomescope/ # GenomeScope model plots + estimates │ ├── summary.txt # genome size, heterozygosity estimates │ └── plot.png ├── 3_spades/ # SPAdes assembly (prokaryote/small eukaryote) │ ├── scaffolds.fasta # final assembly — use this │ └── contigs.fasta ├── 4_canu/ # Canu assembly (long reads) │ └── assembly.contigs.fasta ├── 5_quast/ # QUAST assembly QC stats └── 6_busco/ # BUSCO completeness assessment └── short_summary.txt # % complete single-copy orthologs
💾
Storage warning: The Jellyfish hash file reads.jf can be 20–60 GB for large genomes — as large as the raw reads themselves! Store it on scratch space and delete it once you have the reads.histo file. The histogram is all GenomeScope needs and is only ~10 KB.

K-mer Theory

A k-mer is a substring of length k in a DNA sequence. For a 21-mer of sequence AATTGGCCGTA..., we slide a window of 21 bases across the entire read dataset and count how many times each unique 21-mer appears.

k-mer sizePossible k-mers (4 bases)Notes
k=1717 billionToo small — many false matches
k=214.4 trillionStandard for genome size estimation
k=314.6 × 1018Used for high-coverage assemblies
k=51~4.5 × 1030Long reads only

The key insight: in a diploid genome sequenced at 30× coverage, each unique genomic position appears ~30 times in the reads. A k-mer frequency histogram shows a prominent peak at ~30×. The genome size = (total k-mers) / (peak coverage).

Byte
Interpreting the k-mer histogram:
  • Peak at low coverage (1–5×): sequencing errors — ignore this
  • First true peak (haploid coverage): heterozygous sites in a diploid
  • Second peak (2× first peak): homozygous sites — the main genome peak
  • Higher peaks (3×, 4×): repetitive elements

Step 1 — Jellyfish k-mer Counting

Bash — Jellyfish count + histogram
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=24
#SBATCH --time=12:00:00
#SBATCH --job-name=jellyfish
#SBATCH --mem=64G

module load jellyfish/2.3.0

# Input: raw genomic reads (can be paired-end or single-end)
# Example: 30x coverage of a ~500 Mb genome
R1="reads_R1.fastq.gz"
R2="reads_R2.fastq.gz"

# Step 1: Count k-mers (k=21)
# -C = canonical k-mers (count both strands together)
# -m = k-mer size
# -s = initial hash size (set to ~genome size; can over-estimate)
# -t = threads
# -o = output file
jellyfish count \
    -C -m 21 \
    -s 1000000000 \
    -t 24 \
    -o reads.jf \
    <(zcat ${R1}) <(zcat ${R2})

# Step 2: Generate frequency histogram
jellyfish histo \
    -t 24 \
    reads.jf \
    > reads.histo

echo "Histogram generated: reads.histo"
echo "First 20 lines of histogram (coverage: count):"
head -20 reads.histo

# reads.histo format:
# 1   12405832    <- 1x coverage: mostly errors
# 2   8234521
# ...
# 30  4981234     <- main genome peak near the sequencing depth
# ...
Download:
Jellyfish flags decoded
🔄-C (canonical)Count Canonical k-mers — treat a k-mer and its reverse complement as the same thing. Since DNA is double-stranded, AATTGG and CCAATT are the same sequence. Without -C, you'd count every k-mer twice.
🔢-m 21K-mer size = 21. Longer k-mers are more specific but require more memory. k=21 is the standard for genome size estimation. k=31 is used for assemblies.
💾-s 1000000000Initial hash size in bytes (1 GB here). Set this to roughly your expected genome size. If it's too small, Jellyfish does multiple passes (slower). It's safe to overestimate.
-t 24Use 24 CPU threads. Jellyfish is one of the most efficiently parallelized tools in bioinformatics — 24 threads is ~20× faster than single-threaded.
🔐<(zcat R1) <(zcat R2)Process substitution — decompress the gzipped FASTQs on-the-fly and feed them to Jellyfish without writing uncompressed files to disk. Saves ~50 GB of disk space.
📊jellyfish histoConverts the binary k-mer hash (.jf) into a two-column text histogram: (coverage, count). This is the tiny file that GenomeScope reads. Once you have it, you can delete the large .jf file.

Step 2 — GenomeScope: Genome Property Estimation

GenomeScope fits a mathematical model to the k-mer histogram to estimate: genome size (haploid), heterozygosity rate, repeat fraction, and sequencing error rate.

Bash — GenomeScope (command line)
module load R   # GenomeScope is an R script

# Download GenomeScope if not installed
# wget https://raw.githubusercontent.com/schatzlab/genomescope/master/genomescope.R

Rscript genomescope.R \
    reads.histo \
    21 \
    150 \
    genomescope_output \
    1000

# Arguments:
# reads.histo     = k-mer histogram from jellyfish histo
# 21              = k-mer size used (must match jellyfish -m)
# 150             = read length (check your FASTQ)
# genomescope_output = output directory
# 1000            = max k-mer coverage to model (exclude high repeats)

# GenomeScope also available as web app: http://genomescope.org
# Upload reads.histo directly for instant results

ls genomescope_output/
Download:
GenomeScope arguments decoded
📊reads.histoThe k-mer frequency histogram from Jellyfish. Two columns: frequency (how many times a k-mer appears) and count (how many k-mers appear that many times).
🔢21 (k-mer size)Must match the -m value used in Jellyfish. GenomeScope uses this in its mathematical model to correct for overlapping k-mers at read boundaries.
📏150 (read length)Your sequencing read length. Check with head -8 reads.fastq.gz | zcat | awk 'NR==2{print length($0)}'. Used to estimate the number of k-mers per read.
📁genomescope_outputOutput directory name. GenomeScope creates: summary.txt (genome estimates), plot.png (k-mer histogram + fitted model), and model.txt (model parameters).
✂️1000 (max coverage)Truncate the histogram at 1000× coverage. Highly repetitive k-mers (transposons, rDNA) form a long tail that confuses the model fitter. Truncating focuses the model on the main genome peak.
🌐
Use the GenomeScope web app! Upload your reads.histo to genomescope.org for an interactive visualization with sliders to adjust the model. Much easier for exploring the parameters before committing to command-line runs.

GenomeScope output files:

GenomeScope Output (example)
GenomeScope version 1.0
k = 21

property                      min               max
Heterozygosity                0.502%            0.519%
Genome Haploid Length         246,401,629 bp    246,580,171 bp
Genome Repeat Length          55,423,551 bp     55,463,789 bp
Genome Unique Length          190,978,078 bp    191,116,382 bp
Model Fit                     97.3%
Read Error Rate               0.233%            0.233%

# Output plots:
# genomescope_output/plot.png  -- k-mer histogram fit
# genomescope_output/model.txt -- numerical results
Use GenomeScope results to set assembly parameters. Heterozygosity >1% = use a diploid-aware assembler or set heterozygosity parameter in Canu. Repeat fraction >50% = expect fragmented assembly; long reads mandatory. Genome size estimate sets expected coverage in Canu (genomeSize=).

Step 3 — Choosing an Assembler

AssemblerInputBest forKey limitation
SPAdesShort reads (Illumina)Bacteria, fungi, small eukaryotesNot recommended for large genomes (>300 Mb)
CanuLong reads (PacBio/ONT)Large complex eukaryotesSlow; needs high coverage (>30×)
FlyeLong readsAll sizes; faster than CanuMay need polishing with short reads
MaSuRCAShort + optional long readsEukaryotes, hybridComplex parameter setup
hifiasmPacBio HiFi (CCS)High-quality diploid assembliesRequires HiFi reads specifically

Step 4 — SPAdes Assembly (Prokaryotes)

Bash — SPAdes de novo assembly
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --job-name=spades
#SBATCH --mem=128G

module load spades/3.15.4

# Standard paired-end assembly
spades.py \
    -1 reads_R1.fastq.gz \
    -2 reads_R2.fastq.gz \
    -o spades_output/ \
    --threads 16 \
    --memory 128 \
    -k 21,33,55,77,99   # multiple k-mers (SPAdes iterates automatically)

# For highly heterozygous diploid (add --careful only for small genomes):
# spades.py -1 R1.fastq.gz -2 R2.fastq.gz -o spades_out/ --careful

# Outputs:
# spades_output/scaffolds.fasta  -- final scaffolded assembly
# spades_output/contigs.fasta    -- contigs before scaffolding
# spades_output/spades.log       -- full log

echo "Assembly complete:"
grep -c ">" spades_output/scaffolds.fasta
grep "^>" spades_output/scaffolds.fasta | awk -F'_' '{print $4}' \
    | sort -rn | head -20
Download:

Step 5 — Canu Assembly (Eukaryotes / Long Reads)

Bash — Canu long-read assembly
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=32
#SBATCH --time=120:00:00
#SBATCH --job-name=canu
#SBATCH --mem=256G

module load canu/2.2

# genomeSize: use GenomeScope estimate (e.g., 246m = 246 Mb)
# -nanopore or -pacbio: specify read type

canu \
    -p genome_assembly \
    -d canu_output/ \
    genomeSize=246m \
    -nanopore raw_nanopore_reads.fastq.gz \
    maxThreads=32 \
    maxMemory=256

# For PacBio CLR reads:
# canu -p asm -d canu_out/ genomeSize=246m -pacbio reads.fastq.gz

# For PacBio HiFi (CCS) reads — use hifiasm instead:
# hifiasm -o asm.asm -t 32 hifi_reads.fastq.gz
# awk '/^S/{print ">"$2; print $3}' asm.asm.bp.p_ctg.gfa > asm.p_ctg.fa

# Canu key output:
ls canu_output/
# genome_assembly.contigs.fasta       -- primary contigs
# genome_assembly.unassembled.fasta   -- reads that didn't assemble
# genome_assembly.report              -- assembly statistics
Download:

Step 6 — Assembly Quality Assessment

Bash — QUAST (contiguity) + BUSCO (completeness)
module load quast
module load busco/5.4.3

ASSEMBLY="spades_output/scaffolds.fasta"
# or: ASSEMBLY="canu_output/genome_assembly.contigs.fasta"

# QUAST: assembly contiguity statistics
quast.py ${ASSEMBLY} \
    -o quast_output/ \
    --threads 8

# Key QUAST metrics:
# N50: half the genome is in contigs >= this length
# L50: number of contigs needed to cover 50% of genome
# Largest contig, total length, number of contigs

echo "=== QUAST Summary ==="
cat quast_output/report.txt

# BUSCO: gene content completeness
# -l: lineage dataset (check busco --list-datasets)
# Use a lineage appropriate to your organism:
# bacteria_odb10, fungi_odb10, embryophyta_odb10, metazoa_odb10, etc.
busco \
    -i ${ASSEMBLY} \
    -o busco_output \
    -m genome \
    -l embryophyta_odb10 \
    -c 16

echo "=== BUSCO Summary ==="
cat busco_output/short_summary*.txt
Download:
MetricGood assemblyPoor assembly
BUSCO Complete>90%<70%
BUSCO Duplicated<5%>20% (suggests collapse or false duplication)
N50Close to expected chromosome sizeMuch shorter than expected
Total lengthWithin 10% of GenomeScope estimateVery different suggests contamination or collapse

Troubleshooting

GenomeScope: poor model fit (<80%) or no clear peak

Try a different k-mer size (k=17 or k=31). Check that you used canonical k-mers (-C flag in jellyfish). Very low coverage (<15×) will produce a flat histogram with no clear peak — more data is needed. Contamination also flattens the histogram.

SPAdes: out of memory on a large genome

SPAdes is not designed for large genomes. Switch to MaSuRCA or use a HPC node with >256 GB RAM. For very large genomes (>1 Gb), long reads are essentially mandatory for a useful assembly.

BUSCO: <50% complete genes

Check: (1) Did you use the correct lineage database? (2) Is total assembly length close to GenomeScope estimate? Low BUSCO often means the assembly is highly fragmented — improve with long reads, scaffolding (Hi-C or optical mapping), or polishing with Pilon.