Preprocessing
Preparation in the R
environment
- Set the working directory to import FASTQ files and save all outputs
generated in this report
- If you downloaded and unzipped the demo dataset in your Downloads
folder, your working directory should be:
~/Downloads/MiSeq_SOP
# Do not run
base_dir <- "WORKING_DIRECTORY" # path to your project folder
setwd(base_dir)
Load R packages
- Required packages are listed in Section 2
library(dada2)
library(stringr)
library(patchwork)
library(ggplot2)
library(data.table)
library(Biostrings)
Import FASTQ
files
- List FASTQ files and derive sample IDs
r1_suffix <- "_R1_001.fastq"
r2_suffix <- "_R2_001.fastq"
r1_files <- sort(list.files(base_dir, pattern = paste0(r1_suffix, "$"), full.names = TRUE))
r2_files <- sort(list.files(base_dir, pattern = paste0(r2_suffix, "$"), full.names = TRUE))
sample_ids <- sapply(str_split(basename(r1_files), "_"), "[[", 1)
names(r1_files) <- sample_ids
names(r2_files) <- sample_ids
Read
quality
Raw reads
- Per-base quality profiles from the original demultiplexed FASTQs,
before any trimming or filtering.
plotQualityProfile() shows the distribution of base
quality at each read position across many reads.
- X-axis: Base position along the read (cycle).
- Y-axis: Phred quality score (higher = better base-call
confidence).
- Green line (approx. mean quality): Average trend across
positions.
- Orange line (approx. median quality): Central tendency trend.
p_raw_R1 <- plotQualityProfile(r1_files, aggregate = TRUE) + labs(subtitle = "Read 1")
p_raw_R2 <- plotQualityProfile(r2_files, aggregate = TRUE) + labs(subtitle = "Read 2")
print(p_raw_R1 + p_raw_R2)

After
filterAndTrim
- Post-filter quality profiles: These plots show only the reads that
passed DADA2
filterAndTrim() and were retained for
denoising.
- Trims low-quality sequence ends (truncLen): Cuts
reads to fixed lengths based on quality profiles (e.g., forward = 240
bp) to remove low-quality tail regions.
- Applies expected-error filtering (maxEE): Removes
reads whose total expected errors exceed the set threshold (e.g., maxEE
= c(2,2)), reducing likely sequencing-error reads.
- Removes ambiguous reads (maxN): Discards reads
containing ambiguous bases (N).
filter_trim_dir <- file.path(base_dir, "filter_and_trim")
dir.create(filter_trim_dir)
filt_out_r1 <- file.path(filter_trim_dir, paste0(sample_ids, r1_suffix))
filt_out_r2 <- file.path(filter_trim_dir, paste0(sample_ids, r2_suffix))
# Truncates read to a fixed final length (keeps first N bases, drops the rest)
filt_truncLen <- 0 # no trimming
# Trims bases from the start (left/5′ end) of each read
filt_trimLeft <- 0 # no trimming
# Removes reads likely to contain too many sequencing errors
filt_maxEE <- Inf # no expected-error filtering
# Removes reads containing more than the allowed number of ambiguous bases ("N")
filt_maxN <- 0 # allows no ambiguous bases
truncLen_vec <- c(as.integer(filt_truncLen), as.integer(filt_truncLen)) # R1 and R2
trimLeft_vec <- c(as.integer(filt_trimLeft), as.integer(filt_trimLeft)) # R1 and R2
maxEE_vec <- c(filt_maxEE, filt_maxEE) # R1 and R2
filt_stats <- filterAndTrim(
fwd = r1_files,
rev = r2_files,
filt = filt_out_r1,
filt.rev = filt_out_r2,
truncLen = truncLen_vec,
trimLeft = trimLeft_vec,
maxN = filt_maxN,
maxEE = maxEE_vec
)
filt_dt <- as.data.table(filt_stats, keep.rownames = "sample_id")
p_filt_R1 <- plotQualityProfile(filt_out_r1, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 1")
p_filt_R2 <- plotQualityProfile(filt_out_r2, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 2")
print(p_filt_R1 + p_filt_R2)

Learn the error
rates
- How error learning works: With
learnErrors(), DADA2 pools a large set of quality-filtered
bases, compares observed substitutions (e.g., A→G, C→T) across quality
scores, and fits an empirical model of error probability for each
possible transition as a function of Phred quality.
- Why this matters for denoising: During sample
inference, DADA2 uses these learned probabilities to test whether a
low-abundance sequence is better explained as an error from a more
abundant sequence or as a true biological variant (ASV).
- What the plot shows: The
plotErrors()
output visualizes learned substitution error rates versus nominal
quality-score expectations; close agreement indicates a well-fit model,
while large deviations can suggest insufficient data or atypical run
quality.
- Each panel corresponds to a specific substitution type (e.g., A→G),
showing how often one base is misread as another
- Axes and elements:
- X-axis: Phred quality score (Q). Higher Q means higher
expected base-call accuracy
- Y-axis: Estimated probability of that substitution error on
a log scale
- Points: Empirical error estimates derived from the observed
data used to learn errors
- Lines: The fitted error model that DADA2 will use during
denoising
- How to interpret a “good” plot:
- Monotonic trend: Error rates should generally
decrease as Q increases (higher quality, fewer errors)
- Close agreement: Points and lines should be
reasonably aligned, indicating the model fits the empirical signal
- Reasonable magnitude: At high Q scores, error rates
should be very low; at lower Q scores, they should be higher
# How much data is sampled to fit error rates
# - More bases: usually more stable/accurate error estimates, but longer runtime and higher memory use
# - Fewer bases: faster, but potentially noisier error estimates
# Rule of thumb: nbases = 1e8 (100,000,000 bp) is a “robust/default-like” choice for many datasets
learn_nbases <- 1e8
errF <- learnErrors(filt_out_r1, nbases = learn_nbases)
## 36798033 total bases in 146666 reads from 19 samples will be used for learning the error rates.
errR <- learnErrors(filt_out_r2, nbases = learn_nbases)
## 36718987 total bases in 146666 reads from 19 samples will be used for learning the error rates.
p_errF <- plotErrors(errF, nominalQ = TRUE) + labs(subtitle = "Forward (Read 1)")
p_errR <- plotErrors(errR, nominalQ = TRUE) + labs(subtitle = "Reverse (Read 2)")
print(p_errF + p_errR)

Dereplication and
denoising
- Dereplication: Compresses each sample into a set of
unique sequences and counts, which reduces redundancy and preserves
abundance information needed for inference.
- Efficiency: Operating on unique sequences (rather than all reads)
substantially reduces runtime and memory requirements.
- Accuracy: Abundance-aware, error-model–based inference improves
resolution beyond OTU clustering by separating closely related true
variants from error-derived sequences.
- Denoising (the primary DADA2 step): Using the
learned error model, DADA2 determines whether sequence differences are
likely sequencing errors or true biological variants, then outputs a
per-sample ASV count table (samples × ASVs).
derepF <- derepFastq(filt_out_r1)
names(derepF) <- sub(paste0(r1_suffix, "$"), "", basename(filt_out_r1))
dadaF <- dada(derepF, err = errF)
## Sample 1 - 7733 reads in 2848 unique sequences.
## Sample 2 - 5829 reads in 2379 unique sequences.
## Sample 3 - 5926 reads in 2074 unique sequences.
## Sample 4 - 3158 reads in 1216 unique sequences.
## Sample 5 - 3164 reads in 1252 unique sequences.
## Sample 6 - 4798 reads in 1848 unique sequences.
## Sample 7 - 7331 reads in 2461 unique sequences.
## Sample 8 - 4993 reads in 2020 unique sequences.
## Sample 9 - 16956 reads in 5187 unique sequences.
## Sample 10 - 12332 reads in 3954 unique sequences.
## Sample 11 - 13006 reads in 4333 unique sequences.
## Sample 12 - 5474 reads in 2159 unique sequences.
## Sample 13 - 19489 reads in 5539 unique sequences.
## Sample 14 - 6726 reads in 2093 unique sequences.
## Sample 15 - 4418 reads in 1701 unique sequences.
## Sample 16 - 7933 reads in 2593 unique sequences.
## Sample 17 - 5103 reads in 1632 unique sequences.
## Sample 18 - 5274 reads in 1950 unique sequences.
## Sample 19 - 7023 reads in 2450 unique sequences.
derepR <- derepFastq(filt_out_r2)
names(derepR) <- sub(paste0(r2_suffix, "$"), "", basename(filt_out_r2))
dadaR <- dada(derepR, err = errR)
## Sample 1 - 7733 reads in 5217 unique sequences.
## Sample 2 - 5829 reads in 3868 unique sequences.
## Sample 3 - 5926 reads in 4229 unique sequences.
## Sample 4 - 3158 reads in 2417 unique sequences.
## Sample 5 - 3164 reads in 2390 unique sequences.
## Sample 6 - 4798 reads in 3795 unique sequences.
## Sample 7 - 7331 reads in 5677 unique sequences.
## Sample 8 - 4993 reads in 3623 unique sequences.
## Sample 9 - 16956 reads in 12492 unique sequences.
## Sample 10 - 12332 reads in 8540 unique sequences.
## Sample 11 - 13006 reads in 9018 unique sequences.
## Sample 12 - 5474 reads in 4086 unique sequences.
## Sample 13 - 19489 reads in 12787 unique sequences.
## Sample 14 - 6726 reads in 4727 unique sequences.
## Sample 15 - 4418 reads in 3258 unique sequences.
## Sample 16 - 7933 reads in 5423 unique sequences.
## Sample 17 - 5103 reads in 3640 unique sequences.
## Sample 18 - 5274 reads in 3467 unique sequences.
## Sample 19 - 7023 reads in 4651 unique sequences.
Merge paired
reads
- Combines denoised forward and reverse reads within each
sample by aligning their overlapping regions.
- Reconstructs a single high-confidence amplicon sequence per variant
when overlap is sufficient and concordant.
- Improves accuracy by requiring agreement across the overlap; read
pairs with poor overlap or conflicting bases are not merged.
merge_args <- list(dadaF = dadaF, derepF = derepF, dadaR = dadaR, derepR = derepR, verbose = TRUE)
mergers <- do.call(mergePairs, merge_args)
seqtab <- makeSequenceTable(mergers)
asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))
asv_map_dt <- data.table(
asv_id = asv_ids,
asv_sequence = asv_seqs,
asv_length = nchar(asv_seqs),
total_count = as.integer(colSums(seqtab))
)
to_preview <- seqtab
colnames(to_preview) <- asv_ids
head(to_preview[, c(1:5)])
## ASV001 ASV002 ASV003 ASV004 ASV005
## F3D0 605 357 478 168 196
## F3D1 416 364 79 146 203
## F3D141 462 384 543 200 345
## F3D142 305 313 181 194 94
## F3D143 242 183 250 138 89
## F3D144 439 295 381 120 51
dim(seqtab)
## [1] 19 200
colnames(seqtab)[1] # ASV001
## [1] "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG"
table(nchar(getSequences(seqtab)))
##
## 251 252 253 254 255
## 1 33 160 4 2
Remove chimeras
- DADA2 identifies a sequence as chimeric when it can be reconstructed
as a two-parent (bimera) combination of more abundant ASVs in the
dataset, with breakpoints consistent with recombination.
- Chimera checking is done de novo from the inferred ASVs (no external
reference required), typically using sample-consensus logic to flag
sequences repeatedly identified as bimeric across samples.
- Removes likely PCR chimeras (artificial sequences formed by
recombination of two true templates during amplification) from the ASV
table using
removeBimeraDenovo().
seqtab_nochim <- removeBimeraDenovo(seqtab)
dim(seqtab_nochim)
## [1] 19 189
Read retention
overview
- Record the number of reads removed at each step and export stats
with sample metadata.
getN <- function(x) sum(getUniques(x))
filt_dt$sample_id <- sapply(str_split(filt_dt$sample_id, "_"), "[[", 1)
track_dt <- data.table(
filt_dt,
denoised_F = vapply(dadaF, getN, integer(1)),
denoised_R = vapply(dadaR, getN, integer(1)),
merged = rowSums(seqtab, na.rm = TRUE),
nonchim = rowSums(seqtab_nochim, na.rm = TRUE)
)
track_dt[, pct_remain := 100 * (nonchim / reads.in)]
head(track_dt)
## sample_id reads.in reads.out denoised_F denoised_R merged nonchim pct_remain
## <char> <num> <num> <int> <int> <num> <num> <num>
## 1: F3D0 7793 7733 7472 7096 6043 6043 77.54395
## 2: F3D1 5869 5829 5665 5456 5084 5084 86.62464
## 3: F3D141 5958 5926 5736 5301 4497 4497 75.47835
## 4: F3D142 3183 3158 2991 2718 2237 2237 70.27961
## 5: F3D143 3178 3164 3004 2788 2235 2235 70.32725
## 6: F3D144 4827 4798 4591 4120 3304 3304 68.44831
Export
- Summary statistics: Exported key run-level and
sample-level processing summaries (e.g., reads retained across steps)
for QC documentation and reporting.
- ASV representative sequences (FASTA): Exported the
unique inferred ASV nucleotide sequences from seqtab into
asv_sequences.fasta.
## DNAStringSet object of length 200:
## width seq names
## [1] 252 TACGGAGGATGCGAGCGTTATC...GAAAGTGTGGGTATCGAACAGG ASV001
## [2] 252 TACGGAGGATGCGAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV002
## [3] 252 TACGGAGGATGCGAGCGTTATC...GAAAGTGCGGGGATCGAACAGG ASV003
## [4] 253 TACGGAGGATCCGAGCGTTATC...GAAAGTGTGGGTATCAAACAGG ASV004
## [5] 253 TACGGAGGATTCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV005
## ... ... ...
## [196] 253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV196
## [197] 253 TACGTAGGGTGCGAGCGTTAAT...GAAAGCGTGGGTAGCAAACAGG ASV197
## [198] 253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV198
## [199] 253 TACGTAGGGGGCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV199
## [200] 252 TACGTAGGGGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV200