Bioinformatics 101

Everything you need to know before diving into bioinformatics analysis — Linux essentials, file formats, environment management, HPC basics, and the full tool landscape.

~50 min Beginner Bash / Linux
Byte welcomes you

1 Linux / Bash Essentials

Nearly all bioinformatics tools run on Linux. If you're coming from Windows or macOS, you'll live in the terminal. Here are the commands you'll use daily:

Navigation & Files
Bash
# Where am I?
pwd

# List files (long format, human-readable sizes)
ls -lh

# Move around
cd /path/to/directory
cd ..           # go up one level
cd ~            # go to home directory

# Create directories and files
mkdir -p project/data project/results project/scripts
touch README.md

# Copy, move, remove
cp file.txt backup.txt
mv file.txt new_name.txt
rm unwanted_file.txt         # CAREFUL: no recycle bin!
rm -r unwanted_directory/    # remove directory recursively
Text Processing Power Tools
Bash
# View files
head -20 file.txt       # first 20 lines
tail -20 file.txt       # last 20 lines
less file.txt           # scroll through (press q to quit)
cat file.txt            # print entire file

# Count lines, words, characters
wc -l file.txt          # line count (critical for checking FASTQ!)

# Search inside files
grep "gene_name" results.txt            # find lines containing text
grep -c ">" sequences.fasta             # count FASTA sequences
grep -v "^#" file.vcf | head            # skip comment lines

# Column extraction and manipulation
cut -f1,3 file.tsv                      # extract columns 1 and 3
sort -k2,2n file.tsv                    # sort by column 2 numerically
awk '{print $1, $4-$3}' genes.bed       # compute column differences
sed 's/old/new/g' file.txt              # find-and-replace

# Pipes: chain commands together
cat reads.fastq | grep -c "^@"          # count FASTQ reads
zcat file.fastq.gz | head -8            # peek at gzipped files
Byte key point
The 5 Commands You'll Use Most: grep, awk, sort, cut, and wc. Master these and you can parse any bioinformatics output file without writing a full script.
Permissions & SSH
Bash
# Make a script executable
chmod +x my_script.sh

# Connect to a remote server / HPC cluster
ssh username@cluster.university.edu

# Transfer files to/from server
scp local_file.txt username@cluster:~/data/
scp username@cluster:~/results/output.csv ./

# Transfer entire directories
rsync -avz local_dir/ username@cluster:~/project/
Byte tip
Use screen or tmux when running long jobs via SSH. If your connection drops, the job keeps running. Start with screen -S myanalysis, run your command, then detach with Ctrl+A, D. Reattach with screen -r myanalysis.

2 Bioinformatics File Formats

Knowing your file formats is non-negotiable. Here's every format you'll encounter:

FormatWhat It StoresExtensionKey Commands
FASTAReference sequences (genome, protein).fa, .fasta, .fnagrep -c ">" ref.fa
FASTQRaw sequencing reads + quality scores.fq, .fastq, .fq.gzzcat reads.fq.gz | wc -l ÷4
SAM/BAMAligned reads (text/binary).sam, .bamsamtools view, flagstat, sort, index
CRAMHighly compressed BAM.cramsamtools view (needs reference)
VCF/BCFVariant calls (SNPs, indels).vcf, .vcf.gz, .bcfbcftools view, stats, query
BEDGenomic intervals (chr, start, end).bedbedtools intersect, merge, sort
GFF3/GTFGene annotations (coordinates, features).gff3, .gtfawk '$3=="gene"' file.gff3
WIG/bigWigContinuous coverage signal.wig, .bwView in IGV / UCSC Genome Browser
BIOMMicrobiome OTU/ASV tables.biombiom summarize-table
h5ad/LoomSingle-cell count matrices.h5ad, .loomPython: scanpy.read_h5ad()
FASTQ Format Anatomy
Example: One FASTQ Read (4 lines) @SRR12345678.1 HWI-ST1234:100:AB01:1:1101:1234:2045 length=150 ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG... + IIIIIIIIIIIIIIIIIIIIIHHHHHHHHHGGGGGFFFFEEEDDDCCCCBBBBAAAA####... Line 1: Read identifier (starts with @) Line 2: DNA sequence Line 3: Separator (+) Line 4: Quality scores (ASCII-encoded Phred scores) I = Q40 (99.99% accurate), # = Q2 (37% accurate)
SAM Format Anatomy
Key SAM fields QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL read1 99 chr1 100 60 150M = 350 400 ATCG... IIII... FLAG 99 = paired, mapped, mate mapped, first in pair, on forward strand MAPQ 60 = very high confidence mapping CIGAR 150M = 150 bases perfectly matched
Byte explains
SAM Flags Cheat Sheet: 0x1 = paired | 0x4 = unmapped | 0x10 = reverse strand | 0x100 = secondary alignment | 0x400 = duplicate | 0x800 = supplementary. Use Picard's flag explainer to decode any flag.

3 Conda & Environment Management

Conda is the package manager for bioinformatics. It handles dependencies, versioning, and isolation so your tools don't conflict.

Bash
# Install Miniforge (recommended over full Anaconda)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
# Restart shell, then:

# Create an environment for RNA-Seq analysis
conda create -n rnaseq -c bioconda -c conda-forge \
  fastqc fastp star subread samtools multiqc

# Activate / deactivate
conda activate rnaseq
conda deactivate

# List all environments
conda env list

# Export environment for reproducibility
conda env export > rnaseq_env.yml

# Recreate on another machine
conda env create -f rnaseq_env.yml
Best Practices for Conda
  • One environment per project or workflow — don't install everything in base
  • Always specify channels: -c bioconda -c conda-forge
  • Use Mamba for faster solving: conda install -n base mamba then mamba create ...
  • Pin versions in your YAML for reproducibility: samtools=1.19
  • Export your env and include the YAML in your paper's supplementary
Byte warns
Common Conda Headaches:
  • Solving environment: failed — conflicting package versions. Try creating a fresh env with fewer packages, or use mamba.
  • conda activate doesn't work — run conda init bash and restart your shell.
  • Disk space: conda caches downloads. Clean with conda clean --all.

4 HPC & SLURM Job Submission

Most bioinformatics analyses require more resources than a laptop. High-Performance Computing (HPC) clusters use job schedulers (usually SLURM) to manage shared resources.

Bash
#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --partition=normal
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@university.edu

# Load your conda environment
source activate rnaseq

# Your analysis command
STAR --runThreadN 8 \
  --genomeDir /path/to/genome_index \
  --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix results/sample_

echo "Job finished at: $(date)"
Download this template:
SLURM DirectiveWhat It ControlsHow to Choose
--cpus-per-taskCPU coresMatch the -t or --threads flag of your tool. Don't request more than you use.
--memRAMCheck tool docs. Alignment: ~32G. GATK: ~16G. WGCNA: depends on gene count.
--timeWall time limitOverestimate by 50%. Job gets killed if it exceeds this.
--partitionQueue/partitionAsk your HPC admins. Common: short, normal, long, gpu.
Essential SLURM Commands
Bash
# Submit a job
sbatch my_script.sh

# Check your queued/running jobs
squeue -u $USER

# Cancel a job
scancel 12345678

# Check job efficiency after completion
seff 12345678

# Interactive session (for testing)
srun --cpus-per-task=4 --mem=16G --time=2:00:00 --pty bash
Byte waiting
Pro Workflow: Write a SLURM script for each step. Test with a small subset first (head -40000 reads.fq > test.fq = 10,000 reads). Once the test works, scale up. Use seff on completed test jobs to right-size your resource requests for the full run.

5 R & RStudio Quick Start

R is the primary language for statistical analysis in bioinformatics (DESeq2, Seurat, WGCNA, ggplot2 all run in R).

R
# Install Bioconductor (the bioinformatics R package repository)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install a package from Bioconductor
BiocManager::install("DESeq2")

# Install from CRAN
install.packages("tidyverse")

# Load packages
library(DESeq2)
library(tidyverse)

# Read data
df <- read.csv("my_data.csv")
df <- read.delim("my_data.tsv")   # tab-separated

# Basic data exploration
dim(df)       # rows x columns
head(df)      # first 6 rows
str(df)       # structure / types
summary(df)   # statistical summary

# The tidyverse way
df %>%
  filter(padj < 0.05) %>%
  arrange(desc(abs(log2FoldChange))) %>%
  head(20)

6 Python for Bioinformatics

Python is dominant for sequence manipulation (Biopython), single-cell analysis (Scanpy), and machine learning. Key packages:

Biopython

Parse FASTA/FASTQ, BLAST, GenBank records

Sequences
Scanpy

Python alternative to Seurat for scRNA-Seq

Single-Cell
pandas

DataFrame manipulation (like R's dplyr)

Data
matplotlib / seaborn

Plotting (like ggplot2)

Visualization
pysam

Read BAM/SAM files from Python

Genomics
scikit-learn

ML classifiers, clustering, dimensionality reduction

ML
Python
# Parse a FASTA file with Biopython
from Bio import SeqIO

for record in SeqIO.parse("genome.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")

# Read a count matrix with pandas
import pandas as pd
counts = pd.read_csv("counts.tsv", sep="\t", index_col=0)
print(counts.shape)
print(counts.head())

7 Key Biological Databases

DatabaseWhat It ContainsURL
NCBI GenBankAll publicly available nucleotide sequencesncbi.nlm.nih.gov
NCBI SRARaw sequencing data (FASTQ)ncbi.nlm.nih.gov/sra
EnsemblGenome assemblies, gene annotations (GTF/GFF)ensembl.org
UCSC Genome BrowserInteractive genome visualization, track hubsgenome.ucsc.edu
UniProtProtein sequences and functional annotationsuniprot.org
Gene Ontology (GO)Functional categories for gene annotationgeneontology.org
KEGGMetabolic and signaling pathway mapsgenome.jp/kegg
dbSNPKnown genetic variants (SNPs, indels)ncbi.nlm.nih.gov/snp
GEOPublished expression datasets (microarray, RNA-Seq)ncbi.nlm.nih.gov/geo
PhytozomePlant genomes and gene familiesphytozome.jgi.doe.gov
FlyBase / WormBase / TAIRModel organism databasesflybase.org / wormbase.org / arabidopsis.org
ClinVarClinical significance of human variantsncbi.nlm.nih.gov/clinvar
Byte tip
Downloading Data from SRA: Use fasterq-dump (from SRA Toolkit) or prefetch + fasterq-dump. Example: fasterq-dump --split-3 SRR12345678 -e 8. The --split-3 flag separates paired-end reads into R1 and R2 files.

8 The Complete Tool Landscape

Every major bioinformatics tool organized by analysis type. Click any section to expand.

QC & Trimming
FastQC

Visual QC report for FASTQ files

QC
MultiQC

Aggregate reports from multiple QC tools

QC
fastp

Ultra-fast trimmer + QC in one tool

Trimming
Trimmomatic

Java-based flexible trimmer

Trimming
Cutadapt

Adapter removal specialist

Trimming
BBTools / BBDuk

Swiss army knife for read processing

Multi
Read Alignment
BWA-MEM2

Short-read DNA aligner (WGS, WES)

DNA
STAR

Splice-aware RNA-Seq aligner (fast)

RNA
HISAT2

Splice-aware aligner (memory efficient)

RNA
Bowtie2

Fast short-read aligner (ChIP-Seq, small genomes)

DNA
minimap2

Long-read aligner (ONT, PacBio)

Long-Read
Salmon

Quasi-mapping RNA-Seq quantifier (no BAM needed)

RNA
kallisto

Pseudo-alignment based RNA-Seq quantifier

RNA
CellRanger

10X Genomics scRNA-Seq pipeline (alignment + counting)

scRNA
Differential Expression
DESeq2

Gold standard for bulk RNA-Seq DE

R
edgeR

Fast, flexible DE analysis

R
limma-voom

DE for large datasets, microarray background

R
Sleuth

DE from kallisto transcript quantification

R
MAST

DE for single-cell data (hurdle model)

scRNA
PyDESeq2

Python implementation of DESeq2

Python
Single-Cell Analysis
Seurat (v5)

Most popular scRNA-Seq toolkit (R)

R
Scanpy

Python scRNA-Seq framework (AnnData)

Python
scvi-tools

Deep learning for single-cell (integration, DE)

Python
CellTypist

Automated cell type annotation

Annotation
SingleR

Reference-based cell annotation (R)

R
Monocle3

Trajectory / pseudotime analysis

R
CellChat

Cell-cell communication inference

R
InferCNV

Copy number from scRNA-Seq (cancer)

R
Variant Calling & Genomics
GATK4

Gold standard germline/somatic variant calling

Variants
bcftools

Lightweight variant caller + VCF manipulation

Variants
DeepVariant

Google's deep learning variant caller

AI
Strelka2

Fast germline + somatic calling

Variants
Mutect2

Somatic mutation calling (tumor-normal)

Cancer
SnpEff / VEP

Variant annotation (impact, gene, protein change)

Annotation
ANNOVAR

Multi-database variant annotation

Annotation
PLINK

GWAS, population genetics, linkage

Population
Genome Assembly
hifiasm

Best for PacBio HiFi assembly

Long-Read
Flye

ONT/PacBio long-read assembler

Long-Read
SPAdes

Short-read (Illumina) assembler, also metagenomics

Short-Read
QUAST

Assembly quality assessment (N50, BUSCO)

QC
BUSCO

Gene completeness assessment

QC
Pilon

Assembly polishing with short reads

Polish
Merqury

Reference-free assembly QC with k-mers

QC
GenomeScope2

Estimate genome size, heterozygosity from k-mers

Estimation
Metagenomics & Microbiome
QIIME2

Complete 16S/ITS amplicon pipeline

Amplicon
Kraken2 / Bracken

Fast k-mer taxonomic classification

Shotgun
MetaPhlAn4

Marker-gene based profiling

Shotgun
HUMAnN3

Functional pathway profiling

Function
DADA2

ASV denoising (inside QIIME2 or standalone R)

Amplicon
MEGAHIT / metaSPAdes

Metagenomic assembly

Assembly
CheckM2

MAG quality assessment

QC
Prokka / Bakta

Prokaryotic genome annotation

Annotation
Visualization
ggplot2

Grammar of graphics (R)

R
ComplexHeatmap

Advanced heatmaps with annotations

R
plotly

Interactive charts (R + Python)

Interactive
IGV

Integrative Genomics Viewer (BAM, VCF, BED)

Desktop
Cytoscape

Network visualization and analysis

Networks
Circos / BioCircos

Circular genome plots

Genomics
EnhancedVolcano

Publication-ready volcano plots

R
Krona

Interactive taxonomic composition charts

Microbiome
Workflow Managers & Pipelines
Nextflow

DSL2 workflow language (most popular in bio)

Workflow
nf-core

Community Nextflow pipelines (rnaseq, sarek, etc.)

Pipeline
Snakemake

Python-based workflow manager

Workflow
Galaxy

Web-based GUI for bioinformatics (no coding needed)

GUI
WDL / Cromwell

Broad Institute workflow definition (GATK pipelines)

Workflow

9 Using AI to Accelerate Your Bioinformatics

AI assistants (ChatGPT, Claude, Copilot) are game-changers for bioinformatics. Here's how to use them effectively:

Example: Adapt a Script to Your Data

Copy this template prompt and fill in the bracketed sections:

I have an RNA-Seq count matrix with [NUMBER] genes and [NUMBER] samples. My conditions are: [LIST YOUR CONDITIONS, e.g., "3 Control, 3 Drought-Stressed"]. My organism is [ORGANISM, e.g., "Arabidopsis thaliana"]. My count file is tab-separated with gene IDs as row names and sample names as column headers. Please write a complete DESeq2 R script that: 1. Loads the count matrix from "counts.txt" 2. Creates the sample metadata with my conditions 3. Runs DESeq2 with the appropriate design formula 4. Extracts results with padj < 0.05 and |log2FC| > 1 5. Creates a volcano plot and PCA plot 6. Saves results to a CSV file 7. Runs GO enrichment using clusterProfiler with the correct OrgDb for my organism Please use the [ORGANISM OrgDb package, e.g., org.At.tair.db] for annotation.
Byte key point
Tips for Better AI Prompts:
  • Be specific: Include your organism, sample count, file format, and exact column names
  • Provide context: "I'm running this on an HPC with SLURM" vs. "I'm on my laptop"
  • Paste error messages: Include the full error text when debugging
  • Ask for explanations: "Explain each parameter choice" gets better code than "Write a script"
  • Iterate: Ask follow-ups like "Now add batch correction" or "Convert this to a SLURM script"

Mnemonics & Thumb Rules

Memorize these and you'll never feel lost in a bioinformatics pipeline.

Mnemonic

"Quality → Trim → Align → Count → Diff" — QTACD

The universal RNA-Seq workflow: QC reads → Trim adapters → Align to genome → Count features → Differential expression. Every bulk-seq analysis follows this pattern.

Mnemonic

"FASTQ has Quality, FASTA does Not"

FASTQ = sequences + quality scores (raw reads). FASTA = sequences only (reference genomes, proteins). The "Q" in FASTQ stands for Quality.

Mnemonic

"SAM is Human, BAM is Binary"

SAM = text (human-readable alignment). BAM = compressed binary (machine-efficient). CRAM = even smaller. Always store BAM, never SAM.

Mnemonic

"Conda Create, Conda Activate" — CCA

conda create -n myenvconda activate myenv → install tools → do work → conda deactivate. One environment per project. Never install into base.

Thumb Rule — SLURM Resources: Start with 8 CPUs, 32 GB RAM, 24 hours. Adjust after your first run based on seff <jobid> output. Over-requesting wastes queue priority; under-requesting kills your job.
Thumb Rule — File Sizes: Raw FASTQ: ~1 GB per 3M reads. BAM: ~50% of FASTQ. VCF: tiny (MBs). Always compress with gzip. Delete intermediate files after QC.
Mnemonic

"Grep Awk Sed — the GAS that powers bioinformatics"

grep = search text. awk = extract columns. sed = find & replace. Master these three and you can wrangle any TSV, GFF, or BED file.

Mnemonic

"Padj, Not Pvalue" — PNP

Always filter by adjusted p-value (padj / FDR / q-value), never raw p-value. With 20,000 genes, p < 0.05 gives 1,000 false positives. padj < 0.05 controls false discovery rate.

Thumb Rule — Replicates: Minimum 3 biological replicates per condition for differential analysis. 2 replicates = underpowered. Technical replicates are NOT substitutes for biological replicates.

Summary

Byte
You're Ready! You now know:
  1. Essential Linux/Bash commands for file manipulation
  2. Every major bioinformatics file format and how to inspect them
  3. Conda environment management for reproducible analyses
  4. SLURM/HPC job submission with proper resource requests
  5. R and Python basics for bioinformatics
  6. Key biological databases and how to download data
  7. The complete landscape of bioinformatics tools
  8. How to use AI effectively for scripting and debugging

Now pick a tutorial and dive in! We recommend starting with RNA-Seq Differential Expression or Data Visualization.