1 Research reproducibility

  • Research Reproducibility is the ability of an independent researcher to achieve the exact same results (tables, figures, and statistics) given the same input data and the same computational workflow.

1.1 How to ensure reproducibility

  • Immutable raw data: Treat raw data as “read-only.” Never modify original files manually; always perform cleaning and processing programmatically and save to new output files.
  • Environment control: Use Docker containers to encapsulate the entire operating system, system libraries, R version, and external tools.
  • Literate programming: Use R Markdown to interleave code, narrative, and outputs. This ensures the report is generated directly from the code, preventing copy-paste errors and ensuring the text matches the results.

1.2 Install Docker

  1. Understand Docker
  2. Get Docker
  3. Follow the instructions to install
  4. Run the following shell command in the terminal
$ docker run -it -v ~/Desktop/MiSeq_SOP:/home/hub1 bioinfohub/microbiome R
  • You should see the seqtab_nochim.RDS file, which we generated from the last session
  • If you don’t have it, you can download it from here
base_dir <- "/home/hub1"
dir(base_dir)

2 Environment

  • The following packages were installed on the Docker:
    • dada2
    • Biostrings
    • ShortRead
    • phyloseq
    • stringr
    • patchwork
    • ggplot2
    • data.table

3 Taxonomic assignments

3.1 Load R packages

library(dada2)

3.2 Import ASV table

if (!exists("seqtab_nochim")) { # If seqtab_nochim is not found (e.g., last session was lost)
        seqtab_nochim <- readRDS(file.path(base_dir, "seqtab_nochim.rds"))
}
dim(seqtab_nochim)
## [1]  19 189

3.3 Pre-built database for taxonomic assignment

  • DADA2-formatted training fasta file was downloaded from Taxonomic reference data
    • SILVA: Broad coverage across Bacteria, Archaea, and Eukaryota (SSU rRNA (16S/18S) and LSU rRNA)
    • RDP: rRNA sequences plus the RDP Classifier ecosystem (16S (Bacteria/Archaea) and fungal 28S)
    • GreenGenes2: Unifies 16S and genomic data in one reference tree (16S integrated with genome phylogeny)
# Set the file paths to your taxonomy reference files
tax_train_fasta <- "/home/dbs/silva_nr99_v138.2_toGenus_trainset.fa.gz"
tax_species_fasta <- "/home/dbs/silva_v138.2_assignSpecies.fa.gz"
rdp_train_fasta <- "/home/dbs/rdp_19_toGenus_trainset.fa.gz"
gg2_train_fasta <- "/home/dbs/gg2_2024_09_toGenus_trainset.fa.gz"

3.4 DADA2’s taxonomy assignment

  • The DADA2 function assignTaxonomy() provides a built-in implementation of the naïve Bayesian classifier for taxonomic assignment.
  • The DADA2 function assignSpecies() performs species-level annotation via exact sequence matching, requiring a full-length, unambiguous match between the ASV and a reference sequence annotated at the species level.
    • It reports species only when the sequenced region unambiguously supports the assignment through exact reference matching, making species-level annotation inherently limited.

3.4.1 SILVA

tax_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = tax_train_fasta
)
tax_mat <- addSpecies(tax_mat, refFasta = tax_species_fasta, allowMultiple = TRUE)
saveRDS(tax_mat, file.path(base_dir, "DADA2_taxonomy_matrix_SILVA.rds"))

to_preview <- tax_mat # Removing sequence rownames for display only
rownames(to_preview) <- NULL
head(to_preview)
##      Kingdom    Phylum         Class         Order           Family          
## [1,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [2,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [4,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
## [5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Rikenellaceae" 
## [6,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
##      Genus         Species                  
## [1,] NA            NA                       
## [2,] NA            NA                       
## [3,] NA            NA                       
## [4,] "Bacteroides" "acidifaciens/caecimuris"
## [5,] "Alistipes"   NA                       
## [6,] NA            NA

3.4.2 RDP

rdp_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = rdp_train_fasta
)
saveRDS(rdp_mat, file.path(base_dir, "DADA2_taxonomy_matrix_RDP.rds"))

to_preview <- rdp_mat # Removing sequence rownames for display only
rownames(to_preview) <- NULL
head(to_preview)
##      Kingdom    Phylum         Class         Order           Family          
## [1,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [2,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [4,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
## [5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Rikenellaceae" 
## [6,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
##      Genus        
## [1,] "Duncaniella"
## [2,] "Duncaniella"
## [3,] "Muribaculum"
## [4,] "Bacteroides"
## [5,] "Alistipes"  
## [6,] "Muribaculum"

3.4.3 GG2

gg2_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = gg2_train_fasta
)
saveRDS(gg2_mat, base::file.path(base_dir, "DADA2_taxonomy_matrix_GG2.rds"))

to_preview <- gg2_mat # Removing sequence rownames for display only
rownames(to_preview) <- NULL
head(to_preview)
##      Kingdom       Phylum            Class            Order             
## [1,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
## [2,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
## [3,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
## [4,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
## [5,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
## [6,] "d__Bacteria" "p__Bacteroidota" "c__Bacteroidia" "o__Bacteroidales"
##      Family              Genus                    
## [1,] "f__Muribaculaceae" "g__Duncaniella"         
## [2,] "f__Muribaculaceae" "g__Duncaniella"         
## [3,] "f__Muribaculaceae" "g__UBA7173"             
## [4,] "f__Bacteroidaceae" "g__Bacteroides_H_857956"
## [5,] "f__Rikenellaceae"  "g__Alistipes_A_871400"  
## [6,] "f__Muribaculaceae" NA

4 Limitations of 16S rRNA Sequencing

  • Kingdom restriction:
    • Standard 16S primers target Bacteria and Archaea only.
    • It misses viruses (virome), fungi (mycobiome), and other eukaryotes unless specific markers (like ITS or 18S) are used.
  • Limited taxonomic resolution:
    • 16S sequencing typically resolves taxonomy only to the Genus level.
    • It often struggles to differentiate between closely related species or strains (e.g., distinguishing Escherichia coli from Shigella), whereas shotgun metagenomics can provide strain-level resolution.
  • Predicted vs. actual function:
    • 16S tells you “who is there” (taxonomic composition). Functional capabilities must be predicted based on databases, which is an approximation.
    • Shotgun metagenomics sequences the entire genome, revealing “what they can do” by directly detecting metabolic genes.

5 SessionInfo

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.2 (2025-10-31)
##  os       Ubuntu 24.04.3 LTS
##  system   aarch64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
##  package * version date (UTC) lib source
##  dada2   * 1.36.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
##  Rcpp    * 1.1.1   2026-01-10 [1] RSPM (R 4.5.0)
##
##  [1] /usr/local/lib/R/site-library
##  [2] /usr/local/lib/R/library
##  * ── Packages attached to the search path.
##
## ────────────────────────────────────────────────────────────────────────────────────────────────