Bioinformatics Analysis of 16S rRNA Amplicon Sequencing
SummaryThis article provides a brief introduction to good practices for the bioinformatics analysis of 16S rRNA sequencing by NGS (next-generation sequencing). The bioinformatics pipeline involves two main stages: the preprocessing of data (quality control) and quantification (including taxonomic profiling and predictive metagenomics profiling).
- Author Name: Dianna Gellar
This article provides a brief introduction to good practices for the bioinformatics analysis of 16S rRNA sequencing by NGS (next-generation sequencing). The bioinformatics pipeline involves two main stages: the preprocessing of data (quality control) and quantification (including taxonomic profiling and predictive metagenomics profiling).
Preprocessing to eliminate uninformative data
Removal of adapters, PCR primers, and low quality bases is a necessary step for quality control of sequences. And there are a variety of integrated tools have been developed for this purpose. ‘Q’ is the output quality score for Illumina platforms (Q10 represents 1 error is expected for every 10 bases; Q20 represents 1 error is expected for every 100 bases...). Elimination of sequences with low quality scores can improve the accuracy of bioinformatics analyses. Compared with shotgun sequencing, this is more significant for 16S rRNA amplicon sequencing. For 16S rRNA gene sequencing, it is supposed to set a quality threshold as high as possible and to trim sequences along the entire length.
Taxonomical classification of bacterial sequences
Prior to taxonomic classification, bacterial 16S rRNA genes are clustered using two main approaches. One is to cluster these sequences into phylotypes based on their similarity to reference databases; the other is to cluster sequences into operational taxonomic units (OTUs) using a 97% similarity threshold, only according to their similarity. The available reference databases for annotation of the 16S rRNA gene include GreenGenes, the Ribosomal Database Project (RDP), SILVA, and the Human Microbiome Project (HMP).
Beta (β) diversity to compare microbiomes
Beta (β) diversity measures the difference in bacterial community composition for different samples. Before quantifying β diversity, the read counts (reads mapped to each taxon) must be normalized to minimize the technical variability between samples. There are two common normalization procedures: the total sum and upper quartile normalization.
There are two main methods for quantifying β diversity: phylogenetic β diversity that considers the evolutionary differences between communities (such as UniFrac), and non-phylogenetic or taxon-based methods (such as Bray-Curtis dissimilarity). Once distances or dissimilarities between samples have been determined, they can be ordinated in a low-dimensional space to better illustrate how closely related they are to each other. The two most commonly used ordination tools are principal coordinate analyses (PCoA) and non-metric multidimensional scaling (NMDS).
Predictive metagenomics profiling
OTU abundance table can be further used to presume for metabolic functions. It is a process to understand the role of the microbiome on host metabolism and disease. There are currently three powerful tools for predictive metagenomics profiling (PMP): PICRUSt, Tax4Fun, and Piphillin.
16S rRNA amplicon sequencing is popular due to its cost-efficient, time-effective, and informative features. But it is also limited by several disadvantages. First, 16S is well suited for multiple patients, longitudinal studies, but provides limited taxonomic and functional information. Second, the PCR amplification of different regions of 16S rRNA gene may generate discordant results owing not only to the distinct binding affinities for the corresponding flanking conserved regions, but also owing to the resolution of each variable region across taxa. Therefore, full-length 16S rRNA sequencing or shotgun metagenomics may sometimes be more favorable, especially the latter.