This function performs quality control, filtering, normalization, and transformation
of sequencing data raw counts.
It can also build phyloseq objects for downstream ecological analyses,
and optionally returns intermediate processing steps.
Usage
process_ngs(
X,
sample_data,
taxa_table = NULL,
phylo_tree = NULL,
remove_ids = NULL,
min_reads = 500,
min_prev = 0.1,
normalise = c("load", "TSS", "none"),
load_colname = NULL,
min_load = 10000,
transform = c("clr", "log", "none"),
impute_control = list(method = "GBM", output = "p-counts", z.delete = FALSE, z.warning
= 1, suppress.print = TRUE),
raw_phyloseq = TRUE,
eco_phyloseq = TRUE,
return_all = FALSE,
verbose = TRUE
)Arguments
- X
A numeric matrix or data frame of raw counts with samples as rows and features (e.g., taxa) as columns. Row names must be sample IDs.
- sample_data
A data frame containing sample-level data. Must include a column named
sample_idwith matching row names withX.- taxa_table
Optional. Taxonomy annotation table to build
phyloseqobjects. Row names must match column names ofX.- phylo_tree
Optional. Phylogenetic tree to add to
phyloseqobjects.- remove_ids
A regex or character vector to filter rows in
X. Set toNULLto skip.- min_reads
Numeric. Minimum number of total reads required per sample. Default is 500.
- min_prev
Numeric between 0 and 1. Minimum feature prevalence threshold. Default is 0.1 (i.e., feature must be present in >= 10 % of samples).
- normalise
Normalization method. One of
"load"(microbial load data),"TSS"(total sum scaling), or"none".- load_colname
Column name in
sample_datacontaining microbial load values. Required ifnormalise = "load".- min_load
Numeric. Default is 1e4. Warns if any microbial load value < min_load.
- transform
Transformation method. One of
"clr"(centered log-ratio with zero imputation),"log"(pseudo-log usinglog1p()), or"none". Note: When using"clr", zero values are imputed usingzCompositions::cmultRepl().- impute_control
A named list of arguments to be passed to
zCompositions::cmultRepl().- raw_phyloseq
Logical. If
TRUE, constructs aphyloseqobject with the table of raw counts (filtered failed runs if needed). Default isTRUE.- eco_phyloseq
Logical. If
TRUE, constructs aphyloseqobject with the ecosystem abundances (i.e. afternormalise = "load"). Default isTRUE.- return_all
Logical. If
TRUE, additional intermediate data matrices (X_matched,X_norm,X_prev) are included in the output. Default isFALSE.- verbose
Logical. If
TRUE, prints progress messages during execution. Default isTRUE.
Value
A named list containing:
X_processedMatrix of processed feature counts after filtering, normalization, and transformation.
sdata_finalMatched and filtered
sample_datacorresponding to retained samples.phyloseq_rawphyloseqobject created from raw filtered data.NULLifraw_phyloseq = FALSE.phyloseq_ecophyloseqobject from ecosystem abundance data.NULLifeco_phyloseq = FALSEornormalise != "load".X_matched(Optional) Matched and filtered count matrix, pre-normalization. Returned only if
return_all = TRUE.X_norm(Optional) Normalized count matrix. Returned only if
return_all = TRUE.X_prev(Optional) Prevalence-filtered matrix, pre-transformation. Returned only if
return_all = TRUE.
Details
Zeros are imputed with
zCompositions::cmultRepl()before CLR transformation.QC or other samples are removed if
remove_idsis specified.Sample IDs in
Xandsample_datarow names are matched and aligned.Can generate both a
phyloseq_rawphyloseq object containing raw counts and aphyloseq_ecoobject with ecosystem counts, if aload_colnamecolumn fromsample_datais provided to normalize the counts by microbial load (recommended best practice).
References
#' McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE, 8(4), e61217. doi:10.1371/journal.pone.0061217
Martín-Fernández, J. A., Hron, K., Templ, M., Filzmoser, P., & Palarea-Albaladejo, J. (2015). Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling, 15(2), 134–158. doi:10.1177/1471082X14535524
Palarea-Albaladejo, J., & Martín-Fernández, J. A. (2015). zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems, 143, 85–96. doi:10.1016/j.chemolab.2015.02.019
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology, 8, 2224. doi:10.3389/fmicb.2017.02224
Vandeputte, D., Kathagen, G., D’hoe, K., Vieira-Silva, S., Valles-Colomer, M., Sabino, J., Wang, J., Tito, R. Y., De Commer, L., Darzi, Y., Vermeire, S., Falony, G., & Raes, J. (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature, 551(7681), 507–511. doi:10.1038/nature24460
Examples
if (requireNamespace("phyloseq", quietly = TRUE)) {
mock_X <- matrix(sample(0:1000, 25, replace = TRUE),
nrow = 5,
dimnames = list(paste0("sample", 1:5),
paste0("ASV", 1:5))
)
mock_sample_data <- data.frame(
sample_id = paste0("sample", 1:5),
load = c(1e5, 2e5, 1e4, 5e4, 1.5e5),
condition = factor(rep(c("A", "B"), length.out = 5)),
row.names = paste0("sample", 1:5)
)
mock_taxa_table <- data.frame(
Kingdom = rep("Bacteria", 5),
Genus = paste0("Genus", 1:5),
row.names = paste0("ASV", 1:5)
)
result <- process_ngs(
X = mock_X,
sample_data = mock_sample_data,
taxa_table = mock_taxa_table,
normalise = "load",
load_colname = "load",
transform = "none",
verbose = FALSE
)
}
