We have compiled a list of commonly used software and databases in microbial genomics. If you have additional contributions to this list or notice a mistake, please contact us!

Assembly, Annotation, and Typing

pipelines

Bactopia

Bactopia is a pipeline for QC, de novo assembly, annotation, and typing of bacterial genomes from short or long reads.

Bactopia Documentation

Petit et al. 2020

snippy

Snippy is a pipeline for mapping sequencing reads to a reference genome and performing variant calling. Snippy will also create a core SNP alignment from multiple samples.

Snippy Documentation

de novo assembly

There are numerous options for de novo assembly software. Below are some of the common tools used in bacterial genomics projects.

Short read assembly

SPAdes

SPAdes Documentation

Prjibelski et al. 2020

SKESA

SKESA Documentation

Souvorov et al. 2018

Long read assembly

Trycycler

Trycycler Documentation

Wick et al. 2021

Flye

Flye Documentation

Kolmogorov et al. 2019

Hybrid assembly

Unicycler

Unicycler Documentation

Wick et al. 2017

Reference guided assembly

Reference guided assembly requires choosing an appropriate reference genome, mapping short reads to the reference genome, and calling variants. Below are some popular tools for performing these steps.

Choosing a reference

A closely related, finished reference genome is usually the best choice for mapping and variant calling. For some less diverse species, it is feasible to use a single reference across the species. However, for other species with high nucleotide diversity or large accessory genomes, a reference genome from the same lineage/sequence type/cluster will be most appropriate. When sequencing is performed as part of an experiment, the best choice is always a finished genome of the lab or parental strain used in the experiment.

Mapping

BWA (short reads)

BWA-MEM Documentation

Li. 2013

Minimap2 (long reads)

Minimap2 Documentation

Li. 2018

Variant calling

Pilon

Pilon Documentation

Walker et al. 2014

freebayes

freebayes Documentation

Garrison and Marth. 2012

Annotation

Annotating assemblies

ggCaller

Rather than annotating individual genome assemblies, ggCaller uses pangenome graphs for annotation and clustering.

ggCaller Documentation

Horsfield et al. 2023

Annotating variants in a VCF

SnpEff

SnpEff Documentation

Cingolani et al. 2012

BCFtools

BCFtools csq Documentation

Danecek and McCarthy. 2017

Typing

The following are general tools for typing bacterial genomes. However, often it is best to use a species-specific database or typing software.

Sequence Typing

PubMLST

PubMLST hosts databases for many species with various MLST schemes as well as whole genome sequencing data and typing. BIGSdb software and the RESTful API can be used to interact with PubMLST databases.

Jolley et al. 2018

ARIBA

ARIBA can be used for MLST typing directly from reads using PubMLST databases.

Hunt et al. 2017

Antimicrobial Resistance

ARIBA

ARIBA can be used to identify antimicrobial resistance genes and alleles directly from reads. ARIBA requires a reference dataset. Several generic dataset are included; however, a species-specific reference dataset will produce the most accurate results.

Hunt et al. 2017

NCBI’s AMRFinderPlus

AMRFinderPlus indentifies antimicrobial resistance genes and mutations from assembled genomes. It additionally will detect genes associated with other phenotypes like virulence and metal resistance.

Feldgarden et al. 2021

Alignment

Short multiple sequence alignment

Many aligners are available for alignment of short nucleotide or amino acid sequences. This software generally does not scale to whole genome alignment.

MAFFT

MAFFT Documentation

Katoh et al. 2002

PRANK

PRANK Documentation

Löytynoja. 2014

Whole genome alignment

Creating pseudogenomes from VCFs

If you have variant calls from several isolates mapped to the same reference genome, you can incorporate those variant calls into your reference to create pseudogenomes of the same length. These pseudogenomes can be concatenated into the same file to create an alignment. Creation of alignments from VCFs is often done with custom scripts. However, pipelines like snippy also incorporate this step. If you plan to write such a script, please keep the following in mind:

You should not assume that the reference allele is supported if the VCF includes only variant sites. Missing data from regions with deletions, ambiguous calls, or low coverage should also be incorporated into the reference sequence; this often requires specifying that variant callers output all reference sites to a VCF rather than only variant sites.
You should also filter variants based on coverage and quality in your script.
It can also be useful to mask regions that are known to be problematic for short read mapping and variant calling such as repetitive regions. For example, PE/PPE genes are often excluded from alignments of Mycobacterium genomes.

Alignment of assemblies

Parsnp

Parsnp creates a rapid core genome alignment from assemblies of closely related isolates.

Parsnp Documentation

Kille et al. 2024

Treangen et al. 2014

Mugsy

Mugsy Documentation

Angiuoli et al. 2011

Mauve

Mauve Documentation

Darling et al. 2010

Pan Genome Analysis

Pan genome analysis software uses annotated assemblies to identify core and accessory genes, align sequences in the same orthogroup, and create a core genome alignment.

Panaroo

Panaroo Documentation

Tonkin-Hill et al. 2020

Roary

Note: Roary is no longer being maintained. We have had difficulty installing via conda recently.

Roary Documentation

Page et al. 2015

Phylogenetic analysis

Rapid tree inference

Phylogenetic trees can be rapidly estimated from core SNP alignments. However, branch lengths in SNP alignment based phylogenies will not be accurate. Some software, like IQ-TREE, includes the option to include counts of invariant sites in the model. See this post from Phil Ashton for further information on this topic.

FastTree

If you are using FastTree for phylogenetic analysis of closely related microbial genomes, be sure to use the double-precision version of the software.

FastTree Documentation

Price et al. 2010

IQ-TREE

IQ-TREE Documentation

Nguyen et al. 2015

Recombination-corrected phylogenies

Gubbins

Gubbins Documentation

Croucher et al. 2015

ClonalFrameML

ClonalFrameML Documentation

Didelot and Wilson. 2015

Dated phylogenies

BactDating

BactDating Documentation

Didelot et al. 2018

Clustering

Alignment based clustering

fastbaps

fastbaps Documentation

Tonkin-Hill et al. 2019

kmer based clustering

PopPUNK

PopPUNK Documentation

Lees et al. 2019

Typing based clustering

MLST clustering

For some species, sequence types are assigned to clonal complexes or core genome MLST schemes are used to group isolates based on a fixed threshold of differences in core genome allele assignments.

Genome Wide Association Studies

Input data

unitig-caller

unitig-caller Documentation

panfeed

panfeed Documentation

Sommer et al. 2023

Linear Mixed Models

pyseer

pyseer Documentation

Lees et al. 2018

Phylogeny-based GWAS

treeWAS

treeWAS Documentation

Collins and Didelot. 2018

Data analysis and visualization

Python packages

Polars

Polars website

Browser-based annnotated phylogenies

Microreact

Microreact Website

Argimón et al. 2016

iTOL

iTOL Website

Letunic and Bork. 2021

Phandango

Phandango Website

Hadfield et al. 2018

Other tree visualization software

ggtree (R)

ggtree Documentation

Yu. 2020

Data visualization tips

Friends Don’t Let Friends

Databases

Querying Databases

Searching bacterial genomes from ENA

Search >600k bacterial genomes (all sequenced genomes prior to 2019)

Blackwell et al. 2022

Břinda et al. 2023

Branchwater Metagenome Query

Search metagenomes from NBI’s SRA

Branchwater Metagenome Query

Species specific tools and tips

Neisseria gonorrhoeae

Databases and Tools

PubMLST

Note that the N. gonorrhoeae PubMLST database is combined with other Neisseria

PubMLST Neisseria Database

Jolley et al. 2018

Harrison et al. 2020

Pathogenwatch

Pathogenwatch: Neisseria gonorrhoeae

Sánchez-Busó et al. 2021

NG-STAR

NG-STAR is a typing system based on antimicrobial resistance loci in N. gonorrhoeae.

NG-STAR Database

Demczuk et al. 2017

pyngoST

NG-STAR, MLST, and NG-MAST typing can be performed from genome assemblies using pyngoST from Sánchez-Busó et al..

Divergent and mosaic alleles in N. gonorrhoeae

In addition to high rates of recombination within N. gonorrhoeae, gonococcus can also acquire alleles from other Neisseria species. In some cases, these mosaic alleles can be important contributors to antimicrobial resistance. Mosaic alleles are often divergent enough that reads will not map to a reference genome with a gonococcal allele at these loci (e.g. penA, mtr).

There are also two divergent alleles of porB, porB1a and porB1b. Most gonococcal isolates encode porB1b; however, be aware that reads from an isolate with porB1a will not map well to a reference genome with porB1b.

Mosaic and divergent alleles can be assessed using de novo assemblies.

Repetitive and phase variable loci in N. gonorrhoeae

N. gonorrhoeae employs both antigenic and phase variation facilitated by repetitive sequences across the genome (e.g. pilE/pilS and opa genes) and low complexity regions within genes. These sequences can be difficult to resolve depending on the read length and accuracy of the sequencing technology used.

N. gonorrhoeae also encodes four copies of the 23S rRNA locus. These copies cannot be resolved during assembly of short read sequencing data. If it is important to know the sequence of all four copies (e.g. for identification of azithromycin resistance associated mutations), you can use long read sequencing or map short reads to a single copy of the 23S rRNA sequence and use allele frequencies to identify the number of copies of particular mutation.

Plasmids and mobile elements in N. gonorrhoeae

Most N. gonorrhoeae isolates encode a small cryptic plasmid. Additionally, N. gonorrhoeae can carry plasmids encoding antimicrobial resistance. A small, mobilizable plasmid encoding blaTEM confers high level resistance to pencillin. This plasmid can be difficult to assembly using short read data because of repetitive elements, and contig breaks can occur in the middle of the bla gene. Small plasmid like the cryptic plasmid and blaTEM plasmid can also be lost during Nanopore long read sequencing due to selection for long DNA fragments. Conjugative plasmids, most carrying the tetracycline-resistance conferring gene tetM, can also be found in many N. gonorrhoeae lineages.

Most N. gonorrhoeae also encode nine prophage within their genomes, and many additionally carry the Gonococcal Genetic Island, a mobile element integrated into the chromosome which encodes a Type IV Secretion System.

Assembly, Annotation, and Typing

pipelines

Bactopia

snippy

de novo assembly

Short read assembly

SPAdes

SKESA

Long read assembly

Trycycler

Flye

Hybrid assembly

Unicycler

Reference guided assembly

Choosing a reference

Mapping

BWA (short reads)

Minimap2 (long reads)

Variant calling

Pilon

freebayes

Annotation

Annotating assemblies

Prokka

bakta

NCBI Prokaryotic Genome Annotation Pipeline

ggCaller

Annotating variants in a VCF

SnpEff

BCFtools

Typing

Sequence Typing

PubMLST

ARIBA

Antimicrobial Resistance

ARIBA

NCBI’s AMRFinderPlus

Alignment

Short multiple sequence alignment

MAFFT

PRANK

Whole genome alignment

Creating pseudogenomes from VCFs

Alignment of assemblies

Parsnp

Mugsy

Mauve

Pan Genome Analysis

Panaroo

Roary

Phylogenetic analysis

Rapid tree inference

FastTree

IQ-TREE

Recombination-corrected phylogenies

Gubbins

ClonalFrameML

Dated phylogenies

BactDating

Clustering

Alignment based clustering

fastbaps

kmer based clustering

PopPUNK

Typing based clustering

MLST clustering

Genome Wide Association Studies

Input data

unitig-caller

panfeed

Linear Mixed Models

pyseer

Phylogeny-based GWAS

treeWAS

Data analysis and visualization

Python packages

Polars

Browser-based annnotated phylogenies

Microreact

iTOL