We have compiled a list of commonly used software and databases in microbial genomics. If you have additional contributions to this list or notice a mistake, please contact us!
Assembly, Annotation, and Typing
pipelines
Bactopia
Bactopia is a pipeline for QC, de novo assembly, annotation, and typing of bacterial genomes from short or long reads.
snippy
Snippy is a pipeline for mapping sequencing reads to a reference genome and performing variant calling. Snippy will also create a core SNP alignment from multiple samples.
de novo assembly
There are numerous options for de novo assembly software. Below are some of the common tools used in bacterial genomics projects.
Short read assembly
SPAdes
SKESA
Long read assembly
Trycycler
Flye
Hybrid assembly
Unicycler
Reference guided assembly
Reference guided assembly requires choosing an appropriate reference genome, mapping short reads to the reference genome, and calling variants. Below are some popular tools for performing these steps.
Choosing a reference
A closely related, finished reference genome is usually the best choice for mapping and variant calling. For some less diverse species, it is feasible to use a single reference across the species. However, for other species with high nucleotide diversity or large accessory genomes, a reference genome from the same lineage/sequence type/cluster will be most appropriate. When sequencing is performed as part of an experiment, the best choice is always a finished genome of the lab or parental strain used in the experiment.
Mapping
BWA (short reads)
Minimap2 (long reads)
Variant calling
Pilon
freebayes
Annotation
Annotating assemblies
Prokka
bakta
NCBI Prokaryotic Genome Annotation Pipeline
ggCaller
Rather than annotating individual genome assemblies, ggCaller uses pangenome graphs for annotation and clustering.
Annotating variants in a VCF
SnpEff
BCFtools
Typing
The following are general tools for typing bacterial genomes. However, often it is best to use a species-specific database or typing software.
Sequence Typing
PubMLST
PubMLST hosts databases for many species with various MLST schemes as well as whole genome sequencing data and typing. BIGSdb software and the RESTful API can be used to interact with PubMLST databases.
ARIBA
ARIBA can be used for MLST typing directly from reads using PubMLST databases.
Antimicrobial Resistance
ARIBA
ARIBA can be used to identify antimicrobial resistance genes and alleles directly from reads. ARIBA requires a reference dataset. Several generic dataset are included; however, a species-specific reference dataset will produce the most accurate results.
NCBI’s AMRFinderPlus
AMRFinderPlus indentifies antimicrobial resistance genes and mutations from assembled genomes. It additionally will detect genes associated with other phenotypes like virulence and metal resistance.
Alignment
Short multiple sequence alignment
Many aligners are available for alignment of short nucleotide or amino acid sequences. This software generally does not scale to whole genome alignment.
MAFFT
PRANK
Whole genome alignment
Creating pseudogenomes from VCFs
If you have variant calls from several isolates mapped to the same reference genome, you can incorporate those variant calls into your reference to create pseudogenomes of the same length. These pseudogenomes can be concatenated into the same file to create an alignment. Creation of alignments from VCFs is often done with custom scripts. However, pipelines like snippy also incorporate this step. If you plan to write such a script, please keep the following in mind:
- You should not assume that the reference allele is supported if the VCF includes only variant sites. Missing data from regions with deletions, ambiguous calls, or low coverage should also be incorporated into the reference sequence; this often requires specifying that variant callers output all reference sites to a VCF rather than only variant sites.
- You should also filter variants based on coverage and quality in your script.
- It can also be useful to mask regions that are known to be problematic for short read mapping and variant calling such as repetitive regions. For example, PE/PPE genes are often excluded from alignments of Mycobacterium genomes.
Alignment of assemblies
Parsnp
Parsnp creates a rapid core genome alignment from assemblies of closely related isolates.
Mugsy
Mauve
Pan Genome Analysis
Pan genome analysis software uses annotated assemblies to identify core and accessory genes, align sequences in the same orthogroup, and create a core genome alignment.
Panaroo
Roary
Note: Roary is no longer being maintained. We have had difficulty installing via conda recently.
Phylogenetic analysis
Rapid tree inference
Phylogenetic trees can be rapidly estimated from core SNP alignments. However, branch lengths in SNP alignment based phylogenies will not be accurate. Some software, like IQ-TREE, includes the option to include counts of invariant sites in the model. See this post from Phil Ashton for further information on this topic.
FastTree
If you are using FastTree for phylogenetic analysis of closely related microbial genomes, be sure to use the double-precision version of the software.
IQ-TREE
Recombination-corrected phylogenies
Gubbins
ClonalFrameML
Dated phylogenies
BactDating
Clustering
Alignment based clustering
fastbaps
kmer based clustering
PopPUNK
Typing based clustering
MLST clustering
For some species, sequence types are assigned to clonal complexes or core genome MLST schemes are used to group isolates based on a fixed threshold of differences in core genome allele assignments.
Genome Wide Association Studies
Input data
unitig-caller
panfeed
Linear Mixed Models
pyseer
Phylogeny-based GWAS
treeWAS
Data analysis and visualization
Python packages
Polars
Browser-based annnotated phylogenies
Microreact
iTOL
Phandango
Other tree visualization software
ggtree (R)
Data visualization tips
Databases
Querying Databases
Searching bacterial genomes from ENA
Search >600k bacterial genomes (all sequenced genomes prior to 2019)
Branchwater Metagenome Query
Search metagenomes from NBI’s SRA
Species specific tools and tips
Neisseria gonorrhoeae
Databases and Tools
PubMLST
Note that the N. gonorrhoeae PubMLST database is combined with other Neisseria
Pathogenwatch
Pathogenwatch: Neisseria gonorrhoeae
NG-STAR
NG-STAR is a typing system based on antimicrobial resistance loci in N. gonorrhoeae.
pyngoST
NG-STAR, MLST, and NG-MAST typing can be performed from genome assemblies using pyngoST from Sánchez-Busó et al..
Divergent and mosaic alleles in N. gonorrhoeae
In addition to high rates of recombination within N. gonorrhoeae, gonococcus can also acquire alleles from other Neisseria species. In some cases, these mosaic alleles can be important contributors to antimicrobial resistance. Mosaic alleles are often divergent enough that reads will not map to a reference genome with a gonococcal allele at these loci (e.g. penA, mtr).
There are also two divergent alleles of porB, porB1a and porB1b. Most gonococcal isolates encode porB1b; however, be aware that reads from an isolate with porB1a will not map well to a reference genome with porB1b.
Mosaic and divergent alleles can be assessed using de novo assemblies.
Repetitive and phase variable loci in N. gonorrhoeae
N. gonorrhoeae employs both antigenic and phase variation facilitated by repetitive sequences across the genome (e.g. pilE/pilS and opa genes) and low complexity regions within genes. These sequences can be difficult to resolve depending on the read length and accuracy of the sequencing technology used.
N. gonorrhoeae also encodes four copies of the 23S rRNA locus. These copies cannot be resolved during assembly of short read sequencing data. If it is important to know the sequence of all four copies (e.g. for identification of azithromycin resistance associated mutations), you can use long read sequencing or map short reads to a single copy of the 23S rRNA sequence and use allele frequencies to identify the number of copies of particular mutation.
Plasmids and mobile elements in N. gonorrhoeae
Most N. gonorrhoeae isolates encode a small cryptic plasmid. Additionally, N. gonorrhoeae can carry plasmids encoding antimicrobial resistance. A small, mobilizable plasmid encoding blaTEM confers high level resistance to pencillin. This plasmid can be difficult to assembly using short read data because of repetitive elements, and contig breaks can occur in the middle of the bla gene. Small plasmid like the cryptic plasmid and blaTEM plasmid can also be lost during Nanopore long read sequencing due to selection for long DNA fragments. Conjugative plasmids, most carrying the tetracycline-resistance conferring gene tetM, can also be found in many N. gonorrhoeae lineages.
Most N. gonorrhoeae also encode nine prophage within their genomes, and many additionally carry the Gonococcal Genetic Island, a mobile element integrated into the chromosome which encodes a Type IV Secretion System.