1 Introduction

1.1 Background

  • This project provides a hands-on, guided introduction to microbiome bioinformatics within the Directed Study in Medical Sciences (Winter 2026), focusing on how raw sequencing data are transformed into analysis-ready inputs.
  • Using a 16S rRNA sequencing workflow as a toy/example project, students will practice essential preprocessing steps, including data organization, file handling, filtering, trimming, and foundational quality-control (QC) checks.
  • The project emphasizes practical decision-making in early-stage microbiome analysis, helping students understand how preprocessing choices affect downstream reliability, interpretation, and biological conclusions.

1.2 Aims

  1. Build core technical fluency in microbiome data handling (file formats, metadata organization, reproducible folder/workflow structure).
  2. Apply a consensus preprocessing pipeline for 16S rRNA data, including filtering and trimming of low-quality or non-informative reads.
  3. Evaluate dataset quality using key QC metrics and summarize whether data are suitable for downstream microbial community analysis.
  4. Strengthen analytical reasoning by linking preprocessing parameters to data quality outcomes and potential biases.

1.3 Deliverables

  1. Filtering and trimming record: Documented preprocessing parameters (based on quality profiles) and a short rationale for each key choice.
  2. Quality control summary: Pre-/post-processing QC evidence (quality profiles and read-retention table) showing data quality improvement and reliability.
  3. Export analysis-ready ASVs: Generate and export the ASV (amplicon sequence variant) abundance table (samples × ASVs) and representative ASV sequences.

1.4 Materials

  • Longitudinal mouse 16S rRNA amplicon reads provided in the mothur MiSeq SOP tutorial materials; we used this public dataset to demonstrate the QC workflow.
  • 19 time points, Day 0 (i.e. the day of weaning), 1, 2, 3, 5, 6, 7, 8, 9, 141, 142, 143, 144, 145, 146, 147, 148, 149, and 150, from Female mouse #3, N=19.

2 Environment

  • OS: macOS 26.2
  • Platform: aarch64-apple-darwin20
  • Software: R (v4.5.1)
  • R pakcages: dada2 (v1.36.0), Biostrings (v2.76.0), data.table (v1.17.8), ggplot2 (v4.0.1), patchwork (v1.3.2)

3 Preprocessing

3.1 Preparation in the R environment

  • Set the working directory to import FASTQ files and save all outputs generated in this report
  • If you downloaded and unzipped the demo dataset in your Downloads folder, your working directory should be: ~/Downloads/MiSeq_SOP
# Do not run
base_dir <- "WORKING_DIRECTORY" # path to your project folder
setwd(base_dir)

3.2 Load R packages

  • Required packages are listed in Section 2
library(dada2)
library(stringr)
library(patchwork)
library(ggplot2)
library(data.table)
library(Biostrings)

3.3 Import FASTQ files

  • List FASTQ files and derive sample IDs
r1_suffix <- "_R1_001.fastq"
r2_suffix <- "_R2_001.fastq"

r1_files <- sort(list.files(base_dir, pattern = paste0(r1_suffix, "$"), full.names = TRUE))
r2_files <- sort(list.files(base_dir, pattern = paste0(r2_suffix, "$"), full.names = TRUE))

sample_ids <- sapply(str_split(basename(r1_files), "_"), "[[", 1)
names(r1_files) <- sample_ids
names(r2_files) <- sample_ids

3.4 Read quality

3.4.1 Raw reads

  • Per-base quality profiles from the original demultiplexed FASTQs, before any trimming or filtering.
  • plotQualityProfile() shows the distribution of base quality at each read position across many reads.
    • X-axis: Base position along the read (cycle).
    • Y-axis: Phred quality score (higher = better base-call confidence).
    • Green line (approx. mean quality): Average trend across positions.
    • Orange line (approx. median quality): Central tendency trend.
p_raw_R1 <- plotQualityProfile(r1_files, aggregate = TRUE) + labs(subtitle = "Read 1")
p_raw_R2 <- plotQualityProfile(r2_files, aggregate = TRUE) + labs(subtitle = "Read 2")
print(p_raw_R1 + p_raw_R2)

3.4.2 After filterAndTrim

  • Post-filter quality profiles: These plots show only the reads that passed DADA2 filterAndTrim() and were retained for denoising.
    • Trims low-quality sequence ends (truncLen): Cuts reads to fixed lengths based on quality profiles (e.g., forward = 240 bp) to remove low-quality tail regions.
    • Applies expected-error filtering (maxEE): Removes reads whose total expected errors exceed the set threshold (e.g., maxEE = c(2,2)), reducing likely sequencing-error reads.
    • Removes ambiguous reads (maxN): Discards reads containing ambiguous bases (N).
filter_trim_dir <- file.path(base_dir, "filter_and_trim")
dir.create(filter_trim_dir)

filt_out_r1 <- file.path(filter_trim_dir, paste0(sample_ids, r1_suffix))
filt_out_r2 <- file.path(filter_trim_dir, paste0(sample_ids, r2_suffix))

# Truncates read to a fixed final length (keeps first N bases, drops the rest)
filt_truncLen <- 0 # no trimming
# Trims bases from the start (left/5′ end) of each read
filt_trimLeft <- 0 # no trimming
# Removes reads likely to contain too many sequencing errors
filt_maxEE <- Inf # no expected-error filtering
# Removes reads containing more than the allowed number of ambiguous bases ("N")
filt_maxN <- 0 # allows no ambiguous bases

truncLen_vec <- c(as.integer(filt_truncLen), as.integer(filt_truncLen)) # R1 and R2
trimLeft_vec <- c(as.integer(filt_trimLeft), as.integer(filt_trimLeft)) # R1 and R2
maxEE_vec <- c(filt_maxEE, filt_maxEE) # R1 and R2

filt_stats <- filterAndTrim(
        fwd      = r1_files,
        rev      = r2_files,
        filt     = filt_out_r1,
        filt.rev = filt_out_r2,
        truncLen = truncLen_vec,
        trimLeft = trimLeft_vec,
        maxN     = filt_maxN,
        maxEE    = maxEE_vec
)

filt_dt <- as.data.table(filt_stats, keep.rownames = "sample_id")

p_filt_R1 <- plotQualityProfile(filt_out_r1, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 1")
p_filt_R2 <- plotQualityProfile(filt_out_r2, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 2")
print(p_filt_R1 + p_filt_R2)

3.5 Learn the error rates

  • How error learning works: With learnErrors(), DADA2 pools a large set of quality-filtered bases, compares observed substitutions (e.g., A→G, C→T) across quality scores, and fits an empirical model of error probability for each possible transition as a function of Phred quality.
  • Why this matters for denoising: During sample inference, DADA2 uses these learned probabilities to test whether a low-abundance sequence is better explained as an error from a more abundant sequence or as a true biological variant (ASV).
  • What the plot shows: The plotErrors() output visualizes learned substitution error rates versus nominal quality-score expectations; close agreement indicates a well-fit model, while large deviations can suggest insufficient data or atypical run quality.
    • Each panel corresponds to a specific substitution type (e.g., A→G), showing how often one base is misread as another
    • Axes and elements:
      • X-axis: Phred quality score (Q). Higher Q means higher expected base-call accuracy
      • Y-axis: Estimated probability of that substitution error on a log scale
      • Points: Empirical error estimates derived from the observed data used to learn errors
      • Lines: The fitted error model that DADA2 will use during denoising
    • How to interpret a “good” plot:
      • Monotonic trend: Error rates should generally decrease as Q increases (higher quality, fewer errors)
      • Close agreement: Points and lines should be reasonably aligned, indicating the model fits the empirical signal
      • Reasonable magnitude: At high Q scores, error rates should be very low; at lower Q scores, they should be higher
# How much data is sampled to fit error rates
# - More bases: usually more stable/accurate error estimates, but longer runtime and higher memory use
# - Fewer bases: faster, but potentially noisier error estimates
# Rule of thumb: nbases = 1e8 (100,000,000 bp) is a “robust/default-like” choice for many datasets
learn_nbases <- 1e8

errF <- learnErrors(filt_out_r1, nbases = learn_nbases)
## 36798033 total bases in 146666 reads from 19 samples will be used for learning the error rates.
errR <- learnErrors(filt_out_r2, nbases = learn_nbases)
## 36718987 total bases in 146666 reads from 19 samples will be used for learning the error rates.
p_errF <- plotErrors(errF, nominalQ = TRUE) + labs(subtitle = "Forward (Read 1)")
p_errR <- plotErrors(errR, nominalQ = TRUE) + labs(subtitle = "Reverse (Read 2)")
print(p_errF + p_errR)

3.6 Dereplication and denoising

  • Dereplication: Compresses each sample into a set of unique sequences and counts, which reduces redundancy and preserves abundance information needed for inference.
    • Efficiency: Operating on unique sequences (rather than all reads) substantially reduces runtime and memory requirements.
    • Accuracy: Abundance-aware, error-model–based inference improves resolution beyond OTU clustering by separating closely related true variants from error-derived sequences.
  • Denoising (the primary DADA2 step): Using the learned error model, DADA2 determines whether sequence differences are likely sequencing errors or true biological variants, then outputs a per-sample ASV count table (samples × ASVs).
derepF <- derepFastq(filt_out_r1)
names(derepF) <- sub(paste0(r1_suffix, "$"), "", basename(filt_out_r1))
dadaF <- dada(derepF, err = errF)
## Sample 1 - 7733 reads in 2848 unique sequences.
## Sample 2 - 5829 reads in 2379 unique sequences.
## Sample 3 - 5926 reads in 2074 unique sequences.
## Sample 4 - 3158 reads in 1216 unique sequences.
## Sample 5 - 3164 reads in 1252 unique sequences.
## Sample 6 - 4798 reads in 1848 unique sequences.
## Sample 7 - 7331 reads in 2461 unique sequences.
## Sample 8 - 4993 reads in 2020 unique sequences.
## Sample 9 - 16956 reads in 5187 unique sequences.
## Sample 10 - 12332 reads in 3954 unique sequences.
## Sample 11 - 13006 reads in 4333 unique sequences.
## Sample 12 - 5474 reads in 2159 unique sequences.
## Sample 13 - 19489 reads in 5539 unique sequences.
## Sample 14 - 6726 reads in 2093 unique sequences.
## Sample 15 - 4418 reads in 1701 unique sequences.
## Sample 16 - 7933 reads in 2593 unique sequences.
## Sample 17 - 5103 reads in 1632 unique sequences.
## Sample 18 - 5274 reads in 1950 unique sequences.
## Sample 19 - 7023 reads in 2450 unique sequences.
derepR <- derepFastq(filt_out_r2)
names(derepR) <- sub(paste0(r2_suffix, "$"), "", basename(filt_out_r2))
dadaR <- dada(derepR, err = errR)
## Sample 1 - 7733 reads in 5217 unique sequences.
## Sample 2 - 5829 reads in 3868 unique sequences.
## Sample 3 - 5926 reads in 4229 unique sequences.
## Sample 4 - 3158 reads in 2417 unique sequences.
## Sample 5 - 3164 reads in 2390 unique sequences.
## Sample 6 - 4798 reads in 3795 unique sequences.
## Sample 7 - 7331 reads in 5677 unique sequences.
## Sample 8 - 4993 reads in 3623 unique sequences.
## Sample 9 - 16956 reads in 12492 unique sequences.
## Sample 10 - 12332 reads in 8540 unique sequences.
## Sample 11 - 13006 reads in 9018 unique sequences.
## Sample 12 - 5474 reads in 4086 unique sequences.
## Sample 13 - 19489 reads in 12787 unique sequences.
## Sample 14 - 6726 reads in 4727 unique sequences.
## Sample 15 - 4418 reads in 3258 unique sequences.
## Sample 16 - 7933 reads in 5423 unique sequences.
## Sample 17 - 5103 reads in 3640 unique sequences.
## Sample 18 - 5274 reads in 3467 unique sequences.
## Sample 19 - 7023 reads in 4651 unique sequences.

3.7 Merge paired reads

  • Combines denoised forward and reverse reads within each sample by aligning their overlapping regions.
    • Reconstructs a single high-confidence amplicon sequence per variant when overlap is sufficient and concordant.
    • Improves accuracy by requiring agreement across the overlap; read pairs with poor overlap or conflicting bases are not merged.
merge_args <- list(dadaF = dadaF, derepF = derepF, dadaR = dadaR, derepR = derepR, verbose = TRUE)
mergers <- do.call(mergePairs, merge_args)
seqtab <- makeSequenceTable(mergers)

asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))

asv_map_dt <- data.table(
        asv_id       = asv_ids,
        asv_sequence = asv_seqs,
        asv_length   = nchar(asv_seqs),
        total_count  = as.integer(colSums(seqtab))
)

to_preview <- seqtab
colnames(to_preview) <- asv_ids
head(to_preview[, c(1:5)])
##        ASV001 ASV002 ASV003 ASV004 ASV005
## F3D0      605    357    478    168    196
## F3D1      416    364     79    146    203
## F3D141    462    384    543    200    345
## F3D142    305    313    181    194     94
## F3D143    242    183    250    138     89
## F3D144    439    295    381    120     51
dim(seqtab)
## [1]  19 200
colnames(seqtab)[1] # ASV001
## [1] "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG"
table(nchar(getSequences(seqtab)))
## 
## 251 252 253 254 255 
##   1  33 160   4   2

3.8 Remove chimeras

  • DADA2 identifies a sequence as chimeric when it can be reconstructed as a two-parent (bimera) combination of more abundant ASVs in the dataset, with breakpoints consistent with recombination.
  • Chimera checking is done de novo from the inferred ASVs (no external reference required), typically using sample-consensus logic to flag sequences repeatedly identified as bimeric across samples.
  • Removes likely PCR chimeras (artificial sequences formed by recombination of two true templates during amplification) from the ASV table using removeBimeraDenovo().
seqtab_nochim <- removeBimeraDenovo(seqtab)
dim(seqtab_nochim)
## [1]  19 189

3.9 Read retention overview

  • Record the number of reads removed at each step and export stats with sample metadata.
getN <- function(x) sum(getUniques(x))

filt_dt$sample_id <- sapply(str_split(filt_dt$sample_id, "_"), "[[", 1)
track_dt <- data.table(
        filt_dt,
        denoised_F = vapply(dadaF, getN, integer(1)),
        denoised_R = vapply(dadaR, getN, integer(1)),
        merged     = rowSums(seqtab, na.rm = TRUE),
        nonchim    = rowSums(seqtab_nochim, na.rm = TRUE)
)
track_dt[, pct_remain := 100 * (nonchim / reads.in)]
head(track_dt)
##    sample_id reads.in reads.out denoised_F denoised_R merged nonchim pct_remain
##       <char>    <num>     <num>      <int>      <int>  <num>   <num>      <num>
## 1:      F3D0     7793      7733       7472       7096   6043    6043   77.54395
## 2:      F3D1     5869      5829       5665       5456   5084    5084   86.62464
## 3:    F3D141     5958      5926       5736       5301   4497    4497   75.47835
## 4:    F3D142     3183      3158       2991       2718   2237    2237   70.27961
## 5:    F3D143     3178      3164       3004       2788   2235    2235   70.32725
## 6:    F3D144     4827      4798       4591       4120   3304    3304   68.44831

3.10 Export

  • Summary statistics: Exported key run-level and sample-level processing summaries (e.g., reads retained across steps) for QC documentation and reporting.
  • ASV representative sequences (FASTA): Exported the unique inferred ASV nucleotide sequences from seqtab into asv_sequences.fasta.
## DNAStringSet object of length 200:
##       width seq                                             names               
##   [1]   252 TACGGAGGATGCGAGCGTTATC...GAAAGTGTGGGTATCGAACAGG ASV001
##   [2]   252 TACGGAGGATGCGAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV002
##   [3]   252 TACGGAGGATGCGAGCGTTATC...GAAAGTGCGGGGATCGAACAGG ASV003
##   [4]   253 TACGGAGGATCCGAGCGTTATC...GAAAGTGTGGGTATCAAACAGG ASV004
##   [5]   253 TACGGAGGATTCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV005
##   ...   ... ...
## [196]   253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV196
## [197]   253 TACGTAGGGTGCGAGCGTTAAT...GAAAGCGTGGGTAGCAAACAGG ASV197
## [198]   253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV198
## [199]   253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV199
## [200]   252 TACGTAGGGGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV200

4 Research reproducibility

4.1 Parameters used in this analysis

Parameter Description This analysis Default
filt_truncLen Truncate reads to a fixed length 0 0
filt_trimLeft Remove a fixed number of bases from the 5′ end 0 0
filt_maxEE Discard reads with expected errors exceeding this threshold Inf Inf
filt_maxN Discard reads with more than this number of ambiguous bases (N) 0 0
learn_nbases Number of bases sampled for error rate estimation 1E+08 1E+08

4.2 Session info

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.1 (2025-06-13)
##  os       macOS Tahoe 26.2
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_CA.UTF-8
##  ctype    en_CA.UTF-8
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  BiocGenerics * 0.54.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##  Biostrings   * 2.76.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##  dada2        * 1.36.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##  data.table   * 1.17.8  2025-07-10 [1] CRAN (R 4.5.0)
##  generics     * 0.1.4   2025-05-09 [1] CRAN (R 4.5.0)
##  GenomeInfoDb * 1.44.2  2025-08-18 [1] Bioconductor 3.21 (R 4.5.1)
##  ggplot2      * 4.0.1   2025-11-14 [1] CRAN (R 4.5.2)
##  IRanges      * 2.42.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##  patchwork    * 1.3.2   2025-08-25 [1] CRAN (R 4.5.0)
##  Rcpp         * 1.1.0   2025-07-02 [1] CRAN (R 4.5.0)
##  S4Vectors    * 0.46.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##  stringr      * 1.5.1   2023-11-14 [1] CRAN (R 4.5.0)
##  XVector      * 0.48.0  2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
##  * ── Packages attached to the search path.
## 
## ────────────────────────────────────────────────────────────────────────────────────────────────