diff --git a/README.md b/README.md index 8c19cb8093d9a3b1aa48161c47c417a05718989b..ff52af5c0fe7437eabf632e8d9e5bd89c82ae2c2 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ The repository includes the following contents: 1. **'analysis' folder**: contains scripts and intermediary files used and generated in the analysis. Analysis was largely conducted per primer used in metabarcoding and then synthesized in some higher level analyses. -For each primer pair, the following is provided +For each primer pair (marker), the following is provided - R script for the metabarcoding data analysis, including data preprocessing, diversity analysis, and statistical testing. - Intermediary metabarcoding data, including amplicon sequence variants (ASVs), taxonomy, ASV counts for each sample and DADA2 processing logs. - Metadata for the samples, including NDVI, and other relevant variables. @@ -28,11 +28,13 @@ For each primer pair, the following is provided git clone https://gitlab.nibio.no/simeon/persdatter-et-al-2025-metabarcoding.git ``` -2. Ensure you have the necessary R packages installed. All necessary dependencies required by the analysis scripts are listed in the top chunks of the respective scripts. +2. Optional: Obtain the raw read data from [SRA](https://www.ncbi.nlm.nih.gov/bioproject/) under the BioProject with the accession PRJNA1191665. +Run the DADA2 scripts in `/analysis/raw_read_processing`. This will re-generate the intermediate data from raw reads. The intermediate data is +already contained in each of the marker directories, e.g. `/analysis/16S/16S_DADA2_results_260821`. -3. Obtain the raw read data from [SRA](https://www.ncbi.nlm.nih.gov/bioproject/) under the BioProject with the accession PRJNA1191665. +3. Ensure you have the necessary R packages installed. All necessary dependencies required by the analysis scripts are listed in the top chunks of the respective scripts. -4. Open the R Markdown file and run the code to reproduce the metabarcoding data analysis. +4. Open the R Markdown file contained in each of the marker directories, e.g. `/analysis/16S/16S_DADA2_results_260821` to reproduce the metabarcoding data analysis. ## Citing this Repository diff --git a/analysis/raw_read_processing/16S_Dada2_pipeline_S-mod_v4-3.Rmd b/analysis/raw_read_processing/16S_Dada2_pipeline_S-mod_v4-3.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..1a55837e0409848d70326b605f60390b53a86470 --- /dev/null +++ b/analysis/raw_read_processing/16S_Dada2_pipeline_S-mod_v4-3.Rmd @@ -0,0 +1,869 @@ +--- +title: "dada2 microbiome pipeline v.4.2" +author: "Simeon Rossmann, Marie L. Davey" +date: "16.12.2019" +output: + pdf_document: default + html_document: default +last_modified: "Simeon Rossmann (SWR)" +date_modified: "20.11.2020" +version: 4.3 +--- + +This is an R markdown document for the analysis of MiSeq amplicon sequencing data. In order to customise the pipeline, search 'CHANGE ME' to find different settings you should adjust according to your data. All parameters requiring changes are located within the first 4 chunks. After the fourth chunk has been run, you can simply click 'Run'\>'Run all chunks below' in order to process the data + +Comment SWR: A new chunk was added to allow for quick reanalysis of taxonomy and pooled analysis of multiple samples after error inference. Small changes in chunk 4 (exporting seqtab.RDP in addition) were made to facilitate this. Therefore it's not recommended to 'Run all chunks below' from chunk 4 in this version. + +## Changelog + +### v4 update + +Comment SWR: Two more chunks were added to integrate [LULU](https://www.nature.com/articles/s41467-017-01312-x) post-processing into this pipeline. The first of the two new chunks is a shell script chunk that generates a match list (most similar ASVs for each ASV) using vsearch. The match list can also be generated by other means (f. eks. BLAST). The second of the new chunks is the core LULU chunk that takes the seqtab_nochim file from the dada2 analysis and the match list and performs the LULU post-processing with standard parameters. After this, a new seqtab_nochim, ASV-fasta and RDP taxonomy (optional) are generated in a lulu-subdirectory. + +#### v4.3 + +This update addresses inconsistencies in code usage between authors that were the likely cause of occasional warning messages and errors when creating files and directories in-function on some machines/setups. These problems occurred inconsistently and were not reproduced but seemed to be fixed by cleaner file path encoding. + +- minor code format fixes: + + - implementation of file path concatenation cleaned up by replacing all uses of `paste0()` for file paths with `file.path()` and removal of leading and closing slashes `/` + - replaced all instances of in-code object assignment using `=` with `<-` to clearly distinguish it from user assignable objects and argument assignment in functions. This means that objects assigned with `=` are intended for user input while objects assigned with `<-` are not to be directly manipulated by users. + +- other format fixes: + + - improved cosmetics when knitting by manually adjusted line breaks in code chunks + - added spaces surrounding `=` and following `,` to conform with best practices + - more consistent line spacing in code chunks + - Better text formatting in the Markdown text portions + +#### v4.2 + +- improved cutadapt with quality control and fwd primer removal, can be run through R in separate chunk +- filterAndTrim option expanded with trimLeft to remove the first n 5' bases regardless of content + +#### v4.1 + +- fixed an error when attempting to plot empty files +- fixed an error when attempting to sample more files than available for plotting + + +## Setup R Environment + +Comment SWR: If the OS is Linux, this chunk now passes the path and marker variables to the shell script chunk, making it unnecessary to specify this separately below and avoiding issues with misspelled variables. It is possible that this also works with Mac OS, but this was not included since no Mac machine was available for testing. + +```{r setup, eval=FALSE} +############set paths and marker name +# CHANGE ME to the directory (ABSOLUTE FILEPATH e.g. on Linux /home/usr/...) +# containing the fastq files after unzipping. +path = "/home/simeon/Documents/Marte_Metabarcoding/Run_August21" +#CHANGE ME to the marker you have sequenced +marker = c("16S") +# CHANGE ME to the path for the taxonomy database you will be using for +# identification (if any) +tax_database = "no/path/here" + +#set knitr chunk options +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(eval = FALSE) +knitr::opts_chunk$set(tidy = TRUE) +knitr::opts_chunk$set(root.dir = file.path(path)) + +#load libraries for R session +library(dada2) +library(phyloseq) +library(ggplot2) +library(Biostrings) +library(grid) +library(gridExtra) +library(ShortRead) +library(tidyverse) + +# Get info on the operating system. If Linux, make 'path' and 'marker' available +# globally, if not, warn that the script needs additional attention. +os <- Sys.info()['sysname'] +if (os == "Linux") { + Sys.setenv(projectpath = path) + Sys.setenv(marker = marker) +}else{ + cat(os, "Please run this script on Linux, + it is no longer tested and optimized for Windows") +} +``` + +## Primer presence and orientation check + +This chunk is based on the tutorial for [ITS workflows from the DADA2 GitHub page](https://benjjneb.github.io/dada2/ITS_workflow.html) and offers an assessment of primer presence before cutadapt processing. This should be done to check the orientation used in cutadapt primer removal. Raw fastq.gz files must be located in 'path' or 'path/marker' or 'path/marker/raw_data'. If fastq.gz files are in 'path/marker' or 'path/marker/raw_data', all files will be used in the analysis, if fastq.gz files are in 'path', you may choose whether to only use files that contain the marker in the file name or all files. + +```{r primer-check, eval=FALSE} +#CHANGE ME to number of samples to check. Numerical values (e.g. 1, 5, 10) or +# "all" are accepted inputs. +nSamp = 10 + +#CHANGE ME to TRUE if you only want to move files from 'path' to +# 'path/marker/raw_data' if they contain the marker in the file name. +# E.g. marker = 16S, files = 'Sxxx_16S_L001_R1.fastq.gz'. If FALSE, all fastq.gz +# files will be moved from 'path' to 'path/marker/raw_data'. All fastq.gz files +# will be moved from 'path/marker' to 'path/marker/raw_data' regardless of this. +marker_in_filename = FALSE + +#CHANGE ME by uncommenting the appropriate primer pair +#for the Oomycete ITS1 marker: + #Fprimer = "CGGAAGGATCATTACCAC" + #Rprimer = "AGCCTAGACATCCACTGCTG" +#for the Oomycete ITS2 marker: + #Fprimer = "TTGAACGCAYATTGCACTT" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Fungal ITS1 marker: + #Fprimer = "AAGTCGTAACAAGGTTTCC" + #Rprimer = "TGCVAGARCCAAGAGATC" +#for the Fungal ITS2 marker: + #Fprimer = "GTGARTCATCGAATCTTTG" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Bacterial 16S marker: + Fprimer = "GTGYCAGCMGCCGCGGTAA" + Rprimer = "GGACTACNVGGGTWTCTAAT" +#for the plant Sper marker: + #rc_Fprimer = "GGGCAATCCTGAGCCAA" + #rc_Rprimer = "ATTGAGTCTCTGCACCTATC" +#for the plant Trac marker: + #Fprimer = "CRTTGCCGAGAGTC" + #Rprimer = "AAGTCGTAACAAGG" +#for the nematode Nem marker + #Fprimer = "ACGGTYAGAACTAGGG" + #Rprimer = "GAGGGCAAGTCTGGTG" + +#### NO CHANGES NEEDED BELOW #### +#Check if Marker folder exists, otherwise create it +dir.create(file.path(path, marker)) + +#Check if nSamp is numeric or "all", otherwise set it to "all" +nSamp <- ifelse(is.character(nSamp) | is.numeric(nSamp), nSamp, "all") +nSamp <- ifelse(nSamp == "all" | is.numeric(nSamp), nSamp, "all") + +# Move raw reads to their own subdirectory path/marker/raw_data. If files are in +# path, check whether some contain 'marker' in name. If they exist only move +# those, otherwise move all. +raw_dir <- file.path(path, marker, "raw_data") +dir.create(raw_dir) +if (length(list.files(file.path(path, marker), + pattern = ".fastq.gz", full.names = TRUE)) != 0) { + fnraw <- sort(list.files(file.path(path, marker), + pattern = "fastq.gz", full.names = TRUE)) +} + +if (length(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) != 0) { + if (marker_in_filename) { + fnrawp <- sort(list.files( + path, paste0(marker,"(.*).fastq.gz"), full.names = TRUE)) + }else{ + fnrawp <- sort(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) + } +} +if (exists("fnraw")) { + invisible(file.copy(fnraw, file.path(raw_dir, basename(fnraw)))) + invisible(file.remove(fnraw)) + rm(fnraw)} +if (exists("fnrawp")) { + invisible(file.copy(fnrawp, file.path(raw_dir, basename(fnrawp)))) + invisible(file.remove(fnrawp)) + rm(fnrawp)} + +#Create Lists of Forward and Reverse Filenames and a List of Sample Names +fnFs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +allOrients <- function(primer) { + # Create all orientations of the input sequence + require(Biostrings) + # Biostrings works w/ DNAString objects rather than character vectors + dna <- DNAString(primer) + orients <- c(Forward = dna, + Complement = complement(dna), + Reverse = reverse(dna), + RevComp = reverseComplement(dna)) + return(sapply(orients, toString)) # Convert back to character vector +} + +FWD.orients <- allOrients(Fprimer) +REV.orients <- allOrients(Rprimer) + +# Put N-filtered files in filtN/ subdirectory +fnFs.filtN <- file.path(path, marker, "filtN", basename(fnFs)) +fnRs.filtN <- file.path(path, marker, "filtN", basename(fnRs)) + +filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) + +fnFs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +primerHits <- function(primer, fn) { + # Counts number of reads in which the primer is found + nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) + return(sum(nhits > 0)) +} + +primerHits_perSample <- function(sampnum, fwd = fnFs, rev = fnRs){ + hit_table <- rbind(FWD_primer_R1_Reads = sapply( + FWD.orients, primerHits, fn = fwd[[sampnum]]), + FWD_primer_R2_Reads = sapply(FWD.orients, primerHits, fn = rev[[sampnum]]), + REV_primer_R1_Reads = sapply(REV.orients, primerHits, fn = fwd[[sampnum]]), + REV_primer_R2_Reads = sapply(REV.orients, primerHits, fn = rev[[sampnum]])) + hit_table <- as.data.frame(hit_table) + hit_table <- tibble::rownames_to_column(hit_table, "Primer_Read") + return(hit_table) +} + +nSamp <- ifelse(nSamp == "all", length(fnFs.filtN), nSamp) +nSamp <- ifelse(nSamp > length(fnFs.filtN), length(fnFs.filtN), nSamp) +Samps <- c(1:nSamp) + +primers_before <- lapply(Samps, primerHits_perSample, + fwd = fnFs.filtN, rev = fnRs.filtN) +names(primers_before) <- paste(basename(fnFs.filtN[1:nSamp]), + basename(fnRs.filtN[1:nSamp])) + +primers_before <- dplyr::bind_rows(primers_before, .id = "Sample") %>% + as_tibble() +write.table(primers_before, + file.path(path, marker, paste0("primers_before_cutadapt_", + nSamp, "Samples_", format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) +primers_before %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## Improved cutadapt + +More flexible implementation of the cutadapt processing, allows changing orientation of fwd and rev primers if the chunk above indicated that the orientation for one or both primers is not as expected. Expectation: FWD primer matches R1 in Forward orientation and/or R2 in RevComp orientation, REV primer matches R2 in Forward orientation and/or R1 in RevComp orientation. All other scenarios will require adjusting the primer orientation for cutadapt! This chunk offers a possibility to simply flip the FWD or REV primers separately if necessary but more complex issues will have to be resolved manually. + +Both, the new and old implementation of cutadapt wil NOT run successfully on a Windows system, even though the rest of the pipeline *can*. + +```{r cutadapt-improved, eval=FALSE} +# CHANGE ME to the path of your cutadapt installation. This can be found out by +# running 'which cutadapt' in the bash command line +cutadapt = "/home/simeon/miniconda3/envs/bioinfo/bin/cutadapt" + +# CHANGE ME to flip the FWD primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_FWD = FALSE + +#CHANGE ME to flip the REV primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_REV = FALSE + +#CHANGE ME to define the minimum length of reads after trimming. Reads shorter +# than this value will be discarded and won't appear in the output files. The +# description of this parameter in the deprecated chunk was not correct! +MIN_LENGTH = 15 + +##SWR: define a variable for the number of mismatches to be used in cutadapt +#CHANGE ME to the rate of mismatches (0 =< value < 1) you want cutadapt to +# accept when finding the primer. When choosing a value different from 0, it is +# recommended to compare how many sequences get trimmed when using both, 0 and +# the alternative value. +MISMATCH = 0 + +#### NO CHANGES NEEDED BELOW #### + +# This should print the version of cutadapt you're using if the path is given +# correctly and system2 can access cutadapt. +# THIS WILL NOT WORK IN A WINDOWS SYSTEM! +system2(cutadapt, args = "--version") + +#Make output directory for trimmed reads in path/marker/trim +trim_dir <- file.path(path, marker, "trim") +dir.create(trim_dir) +fnFs_trim <- file.path(trim_dir, paste0(basename(fnFs.filtN), ".trim.fastq.gz")) +fnRs_trim <- file.path(trim_dir, paste0(basename(fnRs.filtN), ".trim.fastq.gz")) + +#Define primers and Revcomps, Flip if specified +F_fwd <- ifelse(flip_FWD, FWD.orients["RevComp"], FWD.orients["Forward"]) +F_rc <- ifelse(flip_FWD, FWD.orients["Forward"], FWD.orients["RevComp"]) +R_fwd <- ifelse(flip_REV, REV.orients["RevComp"], REV.orients["Forward"]) +R_rc <- ifelse(flip_REV, REV.orients["Forward"], REV.orients["RevComp"]) + +# Trim FWD and the reverse-complement of REV off of R1 (forward reads) +R1.flags <- paste("-g", F_fwd, "-a", R_rc) +# Trim REV and the reverse-complement of FWD off of R2 (reverse reads) +R2.flags <- paste("-G", R_fwd, "-A", F_rc) + +# Run Cutadapt +for (i in seq_along(fnFs.filtN)) { + # -n 2 required to remove FWD and REV from reads + system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, + "-m", MIN_LENGTH, "-e", MISMATCH, + # output files + "-o", fnFs_trim[i], "-p", fnRs_trim[i], + # input files + fnFs.filtN[i], fnRs.filtN[i]), + stdout = file.path(trim_dir, paste0(basename(fnFs.filtN[[i]]), + "_cutadapt_output.txt"))) +} + +nSamp <- ifelse(nSamp > length(fnFs_trim), length(fnFs_trim), nSamp) +Samps <- c(1:nSamp) + +primers_after <- lapply(Samps, primerHits_perSample, + fwd = fnFs_trim, rev = fnRs_trim) +names(primers_after) <- paste(basename(fnFs_trim[1:nSamp]), + basename(fnRs_trim[1:nSamp])) + +primers_after <- dplyr::bind_rows(primers_after, .id = "Sample") +write.table(primers_after, file.path(path, marker, paste0( + "primers_after_cutadapt_", nSamp, "Samples_", + format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) + +primers_after %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## assess run quality +looks at a random sample of R1/R2 trimmed reads to assess run quality. + +```{r, QC, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize=100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 +sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + +plts_trim <- lapply(c(sample(fnFL, sample_number), sample(fnRL, sample_number)), + plotQualityProfile) + +m_plts <- marrangeGrob(plts_trim, nrow = 2, ncol = 1) +ggsave(file.path(path, marker, "trim", "qual_plots_after_cutadapt.pdf"), + m_plts, width = 20, height = 26, unit = "cm", dpi = 300) + +print(plts_trim) +``` + +## set parameters for dada2 +This chunk contains several options for dada2 that will be used in the processing of the trimmed reads. The settings are saved to a file with a time stamp. + +```{r, set_params, eval=FALSE} +# CHANGE ME according to the quality of the sequencing run. For good quality +# runs, analysing both R1/R2 is desirable. For very poor quality runs, it may +# be worthwhile to analyse only R1. Select from the following: "R01" "both" +analysis = "both" + +# CHANGE ME use "TRUE" to assign taxonomy, "FALSE" to proceed without taxonomic +# assignments +assign_taxonomy = "FALSE" + +# CHANGE ME use "TRUE" to plot quality profiles for each sample, and "FALSE" to +# speed up analysis by skipping this step (plotting takes some time) and "SUB" +# to plot a subset of 10 samples +plotQC = "SUB" + +# CHANGE ME according to the quality of the sequencing run. This determines the +# maximum expected errors for R1 and R2 during the filtering step. A reasonably +# conservative threshold is (2,2). If the data is of lower quality, it may be +# worthwhile to run with higher EEs ex/ (3,5) +my_maxEEf = 2 +my_maxEEr = 2 + +# CHANGE ME according to the quality of the sequencing run. This is a PHRED score +# quality treshold - all sequences will be truncated at the first base with a +# quality score below this value +my_truncQ = 2 + +# CHANGE ME according to the quality of the sequencing run and according to the +# length of the target region. This is the length to cut the (forward,reverse) +# sequences at. Use 0,0 for no truncation. +my_truncLen = c(0, 0) + +# CHANGE ME to trim the left (5') part of all reads (forward, reverse) if a dip +# is observable in the quality plots +left_trim = c(10, 10) + +# CHANGE ME according to the marker used - this is the minimum length of the +# reads after trimming +my_minLen = 100 + +# CHANGE ME specify the minimum number of bases to overlap during merging, must +# be over 10! +my_minoverlap = 30 + +# CHANGE ME to specify the minimum confidence interval for RDP assignment of +# taxonomy. +my_minBootstrap = 80 + +# SWR: CHANGE ME to TRUE and give a value (number of bases) to collapse ASVs +# with an overlap of 'Collaps_minOverlap' (recommended to be the length of the +# shortest ASV that is identical to a longer one). This is only recommended to +# be used in cases where a number of ASVs occur that differ in length but are +# otherwise identical. If this frequently occurs, it may be an issue of primer +# trimming. +Collapse_overlapping = FALSE +Collapse_minOverlap = 200 + +parameters <- paste( + paste("Run on", format(Sys.time(), "%d/%m/%y %H:%M")), + paste("Path: ",path),"Samples:", + paste(c(sapply(strsplit(basename(sort( + list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE))), "_"), `[`, 1)), collapse = ";"), + paste("Analysis type:",analysis), + paste("Database type:",tax_database), + "\ndada2 Parameters\n", + paste("maximum expected errors R1:",my_maxEEf), + paste("maximum expected errors R2:",my_maxEEr), + paste("minimum length: ",my_minLen), + paste("truncate at first instance of Qscore: ",my_truncQ), + paste("truncation at length (0 = no trunc)", + paste(my_truncLen), collapse = ","), + paste("minimum overlap during merging: ",my_minoverlap), + paste("taxonomy bootstrap threshold: ",my_minBootstrap), + paste("Collapse overlapping ASVs: ", Collapse_overlapping), + paste("Minimum overlap if collapsing overlapping ASVs:", + Collapse_minOverlap), + paste("Trimming 5' (R1, R2):", + paste(left_trim, collapse = ",")), + sep = "\n") +write.table(parameters, file.path(path, marker, paste0("parameters_",analysis, + format(Sys.time(), "%d-%m-%y_%H%M") ,".txt")), + row.names = FALSE) + +``` + +## Processing chunk +#### DO NOT ALTER SCRIPT BEYOND THIS POINT!!! + +Comment SWR: This chunk now passes the output directory to a global variable on Linux, so that the shell script chunk that generates the match list can access it. + +```{r, dada2, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker,"trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize = 100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +#Extract sample names +sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1) + +# begin commands for analysing R01/R02 together + +if (analysis == "both") { + + #Create paths for the filtered files in a subdirectory called filtered/ + filtFs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_F_filt.fastq")) + filtRs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_R_filt.fastq")) + + outp <- file.path(path, marker, "R01_R02") + + #Filter sequences + ##SWR: fixed paired end pairing by adding matchIDs=TRUE + out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, + maxEE = c(my_maxEEf,my_maxEEr), truncQ = my_truncQ, + minLen = my_minLen, trimLeft = left_trim, truncLen = my_truncLen, + rm.phix = TRUE, compress = TRUE, multithread = TRUE, + matchIDs = TRUE) + head(out) + write.table(out, file.path(path, marker, "R01_R02", "out.txt")) + + if (plotQC == "TRUE") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL),plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(c(fnRL),plotQualityProfile) + print(plr) + + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl,nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(sample(c(fnRL), sample_number), plotQualityProfile) + print(plr) + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # dereplicate + derepFs <- derepFastq(filtFs[out[,2] > 0], verbose = TRUE) + derepRs <- derepFastq(filtRs[out[,2] > 0], verbose = TRUE) + + # Name the derep-class objects by the sample names + names(derepFs) <- sample.names[out[,2] > 0] + names(derepRs) <- sample.names[out[,2] > 0] + + #Train dada2 to your dataset + errF <- learnErrors(filtFs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + errR <- learnErrors(filtRs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + + ## Plot the Estimated Error Rates for the Transition Types + #check that model and data match reasonably well + eFplot <- plotErrors(errF, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + eRplot <- plotErrors(errR, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R2_error_profile.pdf"), + eRplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # sample inference + dadaFs <- dada(derepFs, err = errF, multithread = TRUE) + dadaRs <- dada(derepRs, err = errR, multithread = TRUE) + + # merge forward and reverse reads + mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, + minOverlap = my_minoverlap, verbose = TRUE) + + # Inspect the merger data.frame from the first sample + head(mergers[[1]]) + merge.tab <- data.frame(matrix(ncol = 10, nrow = 0)) + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", "prefer", + "accept") + for (i in (1:length(mergers))) { + if (nrow(mergers[[i]]) > 0) { + sub <- cbind(names(mergers)[i], mergers[[i]]) + merge.tab <- rbind(merge.tab,sub) + } else {} + } + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", + "prefer", "accept") + + write.table(merge.tab, file.path(path, marker, "R01_R02", "mergers.txt"), + row.names = FALSE) + + + # make sequence table and distribution table for sequence lengths + seqtab <- makeSequenceTable(mergers) + # use only sequences longer than 50 bp as the rdp classifier can't accept less + # than that + seqtab <- seqtab[,nchar(colnames(seqtab)) > 49] + + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtab <- collapseNoMismatch(seqtab, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + + dim(seqtab) + table(nchar(getSequences(seqtab))) + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtab, file.path(path, marker, "R01_R02", "seqtab.rds")) + + + # remove chimeric sequences + seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtab.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtab.nochim)/sum(seqtab) + write.table(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab.nochim.txt")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab_nochim.rds")) + + # output table for each sample + # define a function that counts sequences in a file/object + getN <- function(x) sum(getUniques(x)) + if (min(out[,2]) > 0) { + track <- cbind(out, sapply(dadaFs, getN), + sapply(dadaRs, getN), + sapply(mergers, getN), + rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1]*100) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- sample.names + } else { + track <- cbind(out[out[,2] > 0,], sapply(dadaFs, getN), sapply(dadaRs, getN), + sapply(mergers, getN), ... = rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1][out[,2] > 0]*100) + + # If processing a single sample, remove the sapply calls: + # e.g. replace sapply(dadaFs, getN) with getN(dadaFs) + track <- rbind(track, cbind(out[,1][out[,2] == 0], + out[,2][out[,2] == 0], + NA,NA,NA,NA,NA)) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- c(sample.names[out[,2] > 0], sample.names[out[,2] == 0]) + } + + head(track) + write.table(track, file.path(path, marker, "R01_R02", "track.txt")) + + if (assign_taxonomy == "TRUE") { + taxa <- assignTaxonomy(seqtab.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxa[[1]]) == colnames(seqtab.nochim)) == nrow(taxa[[1]]) + tmp <- cbind(rownames(taxa[[1]]),taxa[[1]],taxa[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + tmp <- data.frame(tmp) + + write.table(tmp, file.path(path, marker, + "R01_R02", "taxonomy_ASVs_NC_R1R2.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxa$tax, file.path(path, marker, "R01_R02", "taxa.rds")) + + tmp2 <- cbind(t(seqtab.nochim), tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_R02", "seqtab.nochim_withtax.txt")) + + # write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + + # write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtab.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_withtax.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV",1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + } +} else { + + + ###################################################################################### + # Forward Reads Only + ################################################################################### + + # Create paths for the filtered files in a subdirectory called filtered/ + + filtFsR1 <- file.path(path, marker, "R01_only", "filtered", + paste0(sample.names, "_F_filt.fastq")) + + outp <- file.path(path, marker, "R01_only") + + # Filter sequences + outR1 <- filterAndTrim(fnFs, filtFsR1, + maxEE = my_maxEEf, truncQ = my_truncQ, minLen = my_minLen, + trimLeft = left_trim[1], rm.phix = TRUE, + compress = TRUE, multithread = TRUE) + head(outR1) + write.table(outR1, file.path(path, marker, "R01_only", "out.txt")) + + if (plotQC == "TRUE") { + #Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # Train dada2 to your dataset + errFR1 <- learnErrors(filtFsR1[outR1[,2] > 0], multithread = TRUE) + eFplot <- plotErrors(errFR1, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_only", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # dereplicate + derepFsR1 <- derepFastq(filtFsR1[outR1[,2] > 0], verbose = TRUE) + # Name the derep-class objects by the sample names + names(derepFsR1) <- sample.names[outR1[,2] > 0] + + # sample inference + dadaFsR1 <- dada(derepFsR1, err = errFR1, multithread = TRUE) + + + # make sequence table and distribution table for sequence lengths + seqtabR1 <- makeSequenceTable(dadaFsR1) + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtabR1 <- collapseNoMismatch(seqtabR1, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + dim(seqtabR1) + table(nchar(getSequences(seqtabR1))) + # RDP can't assign taxonomy to sequences shorter than 50bp, so filter to > 50bp + seqtabR1 <- seqtabR1[,nchar(colnames(seqtabR1)) > 49] + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtabR1, file.path(path, marker, "R01_only", "seqtab.rds")) + + # remove chimeric sequences + seqtabR1.nochim <- removeBimeraDenovo(seqtabR1, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtabR1.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtabR1.nochim)/sum(seqtabR1) + write.table(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtabR1.nochim")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtab_nochim.rds")) + + # output table for each sample + getN <- function(x) sum(getUniques(x)) + if (min(outR1[,2]) > 0) { + trackR1 <- cbind(outR1, sapply(dadaFsR1,getN), rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1]*100) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names) + } else { + trackR1 <- cbind(outR1[outR1[,2] > 0,], sapply(dadaFsR1, getN), + rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1][outR1[,2] > 0]*100) + trackR1 <- rbind(trackR1,cbind(outR1[,1][outR1[,2] == 0], + outR1[,2][outR1[,2] == 0], + NA,NA,NA)) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names[!outR1[,2] == 0], + sample.names[outR1[,2] == 0]) + } + + head(trackR1) + write.table(trackR1, file.path(path, marker, "R01_only", "trackR1.txt")) + + if (assign_taxonomy == "TRUE") { + taxaR1 <- assignTaxonomy(seqtabR1.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxaR1[[1]]) == colnames(seqtabR1.nochim)) == nrow(taxaR1[[1]]) + tmp <- cbind(rownames(taxaR1[[1]]), taxaR1[[1]], taxaR1[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + tmp <- data.frame(tmp) + write.table(tmp, file.path(path, marker, + "R01_only", "taxonomy_ASVs_NC_R1.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxaR1$tax, file.path(path, marker, "R01_only", "taxa.rds")) + + tmp2 <- cbind(t(seqtabR1.nochim),tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_only", "seqtabR1.nochim_withtax.txt")) + + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + + #write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtabR1.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_only", "ASVs_withtax_R1.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + } +} + +#make paths available to subsequent chunks +dada2_dir <- outp +#If Linux, make 'dada2_folder'available globally. + if (os == "Linux") { + Sys.setenv(dada2Folder = dada2_dir) + } +``` diff --git a/analysis/raw_read_processing/FITS1_Dada2_pipeline_S-mod_v4-3.Rmd b/analysis/raw_read_processing/FITS1_Dada2_pipeline_S-mod_v4-3.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..c7eb16269b0bd3dbc02a82792c7d7d6ae19fe22d --- /dev/null +++ b/analysis/raw_read_processing/FITS1_Dada2_pipeline_S-mod_v4-3.Rmd @@ -0,0 +1,870 @@ +--- +title: "dada2 microbiome pipeline v.4.2" +author: "Simeon Rossmann, Marie L. Davey" +date: "16.12.2019" +output: + pdf_document: default + html_document: default +last_modified: "Simeon Rossmann (SWR)" +date_modified: "20.11.2020" +version: 4.3 +--- + +This is an R markdown document for the analysis of MiSeq amplicon sequencing data. In order to customise the pipeline, search 'CHANGE ME' to find different settings you should adjust according to your data. All parameters requiring changes are located within the first 4 chunks. After the fourth chunk has been run, you can simply click 'Run'\>'Run all chunks below' in order to process the data + +Comment SWR: A new chunk was added to allow for quick reanalysis of taxonomy and pooled analysis of multiple samples after error inference. Small changes in chunk 4 (exporting seqtab.RDP in addition) were made to facilitate this. Therefore it's not recommended to 'Run all chunks below' from chunk 4 in this version. + +## Changelog + +### v4 update + +Comment SWR: Two more chunks were added to integrate [LULU](https://www.nature.com/articles/s41467-017-01312-x) post-processing into this pipeline. The first of the two new chunks is a shell script chunk that generates a match list (most similar ASVs for each ASV) using vsearch. The match list can also be generated by other means (f. eks. BLAST). The second of the new chunks is the core LULU chunk that takes the seqtab_nochim file from the dada2 analysis and the match list and performs the LULU post-processing with standard parameters. After this, a new seqtab_nochim, ASV-fasta and RDP taxonomy (optional) are generated in a lulu-subdirectory. + +#### v4.3 + +This update addresses inconsistencies in code usage between authors that were the likely cause of occasional warning messages and errors when creating files and directories in-function on some machines/setups. These problems occurred inconsistently and were not reproduced but seemed to be fixed by cleaner file path encoding. + +- minor code format fixes: + + - implementation of file path concatenation cleaned up by replacing all uses of `paste0()` for file paths with `file.path()` and removal of leading and closing slashes `/` + - replaced all instances of in-code object assignment using `=` with `<-` to clearly distinguish it from user assignable objects and argument assignment in functions. This means that objects assigned with `=` are intended for user input while objects assigned with `<-` are not to be directly manipulated by users. + +- other format fixes: + + - improved cosmetics when knitting by manually adjusted line breaks in code chunks + - added spaces surrounding `=` and following `,` to conform with best practices + - more consistent line spacing in code chunks + - Better text formatting in the Markdown text portions + +#### v4.2 + +- improved cutadapt with quality control and fwd primer removal, can be run through R in separate chunk +- filterAndTrim option expanded with trimLeft to remove the first n 5' bases regardless of content + +#### v4.1 + +- fixed an error when attempting to plot empty files +- fixed an error when attempting to sample more files than available for plotting + + +## Setup R Environment + +Comment SWR: If the OS is Linux, this chunk now passes the path and marker variables to the shell script chunk, making it unnecessary to specify this separately below and avoiding issues with misspelled variables. It is possible that this also works with Mac OS, but this was not included since no Mac machine was available for testing. + +```{r setup, eval=FALSE} +############set paths and marker name +# CHANGE ME to the directory (ABSOLUTE FILEPATH e.g. on Linux /home/usr/...) +# containing the fastq files after unzipping. +path = "/home/simeon/Documents/Marte_Metabarcoding/Run_August21" +#CHANGE ME to the marker you have sequenced +marker = c("FITS1") +# CHANGE ME to the path for the taxonomy database you will be using for +# identification (if any) +tax_database = "no/path/here" + +#set knitr chunk options +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(eval = FALSE) +knitr::opts_chunk$set(tidy = TRUE) +knitr::opts_chunk$set(root.dir = file.path(path)) + +#load libraries for R session +library(dada2) +library(phyloseq) +library(ggplot2) +library(Biostrings) +library(grid) +library(gridExtra) +library(ShortRead) +library(tidyverse) + +# Get info on the operating system. If Linux, make 'path' and 'marker' available +# globally, if not, warn that the script needs additional attention. +os <- Sys.info()['sysname'] +if (os == "Linux") { + Sys.setenv(projectpath = path) + Sys.setenv(marker = marker) +}else{ + cat(os, "Please run this script on Linux, + it is no longer tested and optimized for Windows") +} +``` + +## Primer presence and orientation check + +This chunk is based on the tutorial for [ITS workflows from the DADA2 GitHub page](https://benjjneb.github.io/dada2/ITS_workflow.html) and offers an assessment of primer presence before cutadapt processing. This should be done to check the orientation used in cutadapt primer removal. Raw fastq.gz files must be located in 'path' or 'path/marker' or 'path/marker/raw_data'. If fastq.gz files are in 'path/marker' or 'path/marker/raw_data', all files will be used in the analysis, if fastq.gz files are in 'path', you may choose whether to only use files that contain the marker in the file name or all files. + +```{r primer-check, eval=FALSE} +#CHANGE ME to number of samples to check. Numerical values (e.g. 1, 5, 10) or +# "all" are accepted inputs. +nSamp = 10 + +#CHANGE ME to TRUE if you only want to move files from 'path' to +# 'path/marker/raw_data' if they contain the marker in the file name. +# E.g. marker = 16S, files = 'Sxxx_16S_L001_R1.fastq.gz'. If FALSE, all fastq.gz +# files will be moved from 'path' to 'path/marker/raw_data'. All fastq.gz files +# will be moved from 'path/marker' to 'path/marker/raw_data' regardless of this. +marker_in_filename = FALSE + +#CHANGE ME by uncommenting the appropriate primer pair +#for the Oomycete ITS1 marker: + #Fprimer = "CGGAAGGATCATTACCAC" + #Rprimer = "AGCCTAGACATCCACTGCTG" +#for the Oomycete ITS2 marker: + #Fprimer = "TTGAACGCAYATTGCACTT" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Fungal ITS1 marker: + Fprimer = "AAGTCGTAACAAGGTTTCC" + Rprimer = "TGCVAGARCCAAGAGATC" +#for the Fungal ITS2 marker: + #Fprimer = "GTGARTCATCGAATCTTTG" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Bacterial 16S marker: + #Fprimer = "GTGYCAGCMGCCGCGGTAA" + #Rprimer = "GGACTACNVGGGTWTCTAAT" +#for the plant Sper marker: + #rc_Fprimer = "GGGCAATCCTGAGCCAA" + #rc_Rprimer = "ATTGAGTCTCTGCACCTATC" +#for the plant Trac marker: + #Fprimer = "CRTTGCCGAGAGTC" + #Rprimer = "AAGTCGTAACAAGG" +#for the nematode Nem marker + #Fprimer = "ACGGTYAGAACTAGGG" + #Rprimer = "GAGGGCAAGTCTGGTG" + +#### NO CHANGES NEEDED BELOW #### +#Check if Marker folder exists, otherwise create it +dir.create(file.path(path, marker)) + +#Check if nSamp is numeric or "all", otherwise set it to "all" +nSamp <- ifelse(is.character(nSamp) | is.numeric(nSamp), nSamp, "all") +nSamp <- ifelse(nSamp == "all" | is.numeric(nSamp), nSamp, "all") + +# Move raw reads to their own subdirectory path/marker/raw_data. If files are in +# path, check whether some contain 'marker' in name. If they exist only move +# those, otherwise move all. +raw_dir <- file.path(path, marker, "raw_data") +dir.create(raw_dir) +if (length(list.files(file.path(path, marker), + pattern = ".fastq.gz", full.names = TRUE)) != 0) { + fnraw <- sort(list.files(file.path(path, marker), + pattern = "fastq.gz", full.names = TRUE)) +} + +if (length(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) != 0) { + if (marker_in_filename) { + fnrawp <- sort(list.files( + path, paste0(marker,"(.*).fastq.gz"), full.names = TRUE)) + }else{ + fnrawp <- sort(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) + } +} +if (exists("fnraw")) { + invisible(file.copy(fnraw, file.path(raw_dir, basename(fnraw)))) + invisible(file.remove(fnraw)) + rm(fnraw)} +if (exists("fnrawp")) { + invisible(file.copy(fnrawp, file.path(raw_dir, basename(fnrawp)))) + invisible(file.remove(fnrawp)) + rm(fnrawp)} + +#Create Lists of Forward and Reverse Filenames and a List of Sample Names +fnFs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +allOrients <- function(primer) { + # Create all orientations of the input sequence + require(Biostrings) + # Biostrings works w/ DNAString objects rather than character vectors + dna <- DNAString(primer) + orients <- c(Forward = dna, + Complement = complement(dna), + Reverse = reverse(dna), + RevComp = reverseComplement(dna)) + return(sapply(orients, toString)) # Convert back to character vector +} + +FWD.orients <- allOrients(Fprimer) +REV.orients <- allOrients(Rprimer) + +# Put N-filtered files in filtN/ subdirectory +fnFs.filtN <- file.path(path, marker, "filtN", basename(fnFs)) +fnRs.filtN <- file.path(path, marker, "filtN", basename(fnRs)) + +filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) + +fnFs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +primerHits <- function(primer, fn) { + # Counts number of reads in which the primer is found + nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) + return(sum(nhits > 0)) +} + +primerHits_perSample <- function(sampnum, fwd = fnFs, rev = fnRs){ + hit_table <- rbind(FWD_primer_R1_Reads = sapply( + FWD.orients, primerHits, fn = fwd[[sampnum]]), + FWD_primer_R2_Reads = sapply(FWD.orients, primerHits, fn = rev[[sampnum]]), + REV_primer_R1_Reads = sapply(REV.orients, primerHits, fn = fwd[[sampnum]]), + REV_primer_R2_Reads = sapply(REV.orients, primerHits, fn = rev[[sampnum]])) + hit_table <- as.data.frame(hit_table) + hit_table <- tibble::rownames_to_column(hit_table, "Primer_Read") + return(hit_table) +} + +nSamp <- ifelse(nSamp == "all", length(fnFs.filtN), nSamp) +nSamp <- ifelse(nSamp > length(fnFs.filtN), length(fnFs.filtN), nSamp) +Samps <- c(1:nSamp) + +primers_before <- lapply(Samps, primerHits_perSample, + fwd = fnFs.filtN, rev = fnRs.filtN) +names(primers_before) <- paste(basename(fnFs.filtN[1:nSamp]), + basename(fnRs.filtN[1:nSamp])) + +primers_before <- dplyr::bind_rows(primers_before, .id = "Sample") %>% + as_tibble() +write.table(primers_before, + file.path(path, marker, paste0("primers_before_cutadapt_", + nSamp, "Samples_", format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) +primers_before %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## Improved cutadapt + +More flexible implementation of the cutadapt processing, allows changing orientation of fwd and rev primers if the chunk above indicated that the orientation for one or both primers is not as expected. Expectation: FWD primer matches R1 in Forward orientation and/or R2 in RevComp orientation, REV primer matches R2 in Forward orientation and/or R1 in RevComp orientation. All other scenarios will require adjusting the primer orientation for cutadapt! This chunk offers a possibility to simply flip the FWD or REV primers separately if necessary but more complex issues will have to be resolved manually. + +Both, the new and old implementation of cutadapt wil NOT run successfully on a Windows system, even though the rest of the pipeline *can*. + +```{r cutadapt-improved, eval=FALSE} +# CHANGE ME to the path of your cutadapt installation. This can be found out by +# running 'which cutadapt' in the bash command line +cutadapt = "/home/simeon/miniconda3/envs/bioinfo/bin/cutadapt" + +# CHANGE ME to flip the FWD primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_FWD = FALSE + +#CHANGE ME to flip the REV primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_REV = FALSE + +#CHANGE ME to define the minimum length of reads after trimming. Reads shorter +# than this value will be discarded and won't appear in the output files. The +# description of this parameter in the deprecated chunk was not correct! +MIN_LENGTH = 15 + +##SWR: define a variable for the number of mismatches to be used in cutadapt +#CHANGE ME to the rate of mismatches (0 =< value < 1) you want cutadapt to +# accept when finding the primer. When choosing a value different from 0, it is +# recommended to compare how many sequences get trimmed when using both, 0 and +# the alternative value. +MISMATCH = 0 + +#### NO CHANGES NEEDED BELOW #### + +# This should print the version of cutadapt you're using if the path is given +# correctly and system2 can access cutadapt. +# THIS WILL NOT WORK IN A WINDOWS SYSTEM! +system2(cutadapt, args = "--version") + +#Make output directory for trimmed reads in path/marker/trim +trim_dir <- file.path(path, marker, "trim") +dir.create(trim_dir) +fnFs_trim <- file.path(trim_dir, paste0(basename(fnFs.filtN), ".trim.fastq.gz")) +fnRs_trim <- file.path(trim_dir, paste0(basename(fnRs.filtN), ".trim.fastq.gz")) + +#Define primers and Revcomps, Flip if specified +F_fwd <- ifelse(flip_FWD, FWD.orients["RevComp"], FWD.orients["Forward"]) +F_rc <- ifelse(flip_FWD, FWD.orients["Forward"], FWD.orients["RevComp"]) +R_fwd <- ifelse(flip_REV, REV.orients["RevComp"], REV.orients["Forward"]) +R_rc <- ifelse(flip_REV, REV.orients["Forward"], REV.orients["RevComp"]) + +# Trim FWD and the reverse-complement of REV off of R1 (forward reads) +R1.flags <- paste("-g", F_fwd, "-a", R_rc) +# Trim REV and the reverse-complement of FWD off of R2 (reverse reads) +R2.flags <- paste("-G", R_fwd, "-A", F_rc) + +# Run Cutadapt +for (i in seq_along(fnFs.filtN)) { + # -n 2 required to remove FWD and REV from reads + system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, + "-m", MIN_LENGTH, "-e", MISMATCH, + # output files + "-o", fnFs_trim[i], "-p", fnRs_trim[i], + # input files + fnFs.filtN[i], fnRs.filtN[i]), + stdout = file.path(trim_dir, paste0(basename(fnFs.filtN[[i]]), + "_cutadapt_output.txt"))) +} + +nSamp <- ifelse(nSamp > length(fnFs_trim), length(fnFs_trim), nSamp) +Samps <- c(1:nSamp) + +primers_after <- lapply(Samps, primerHits_perSample, + fwd = fnFs_trim, rev = fnRs_trim) +names(primers_after) <- paste(basename(fnFs_trim[1:nSamp]), + basename(fnRs_trim[1:nSamp])) + +primers_after <- dplyr::bind_rows(primers_after, .id = "Sample") +write.table(primers_after, file.path(path, marker, paste0( + "primers_after_cutadapt_", nSamp, "Samples_", + format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) + +primers_after %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## assess run quality +looks at a random sample of R1/R2 trimmed reads to assess run quality. + +```{r, QC, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize=100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 +sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + +plts_trim <- lapply(c(sample(fnFL, sample_number), sample(fnRL, sample_number)), + plotQualityProfile) + +m_plts <- marrangeGrob(plts_trim, nrow = 2, ncol = 1) +ggsave(file.path(path, marker, "trim", "qual_plots_after_cutadapt.pdf"), + m_plts, width = 20, height = 26, unit = "cm", dpi = 300) + +print(plts_trim) +``` + +## set parameters for dada2 +This chunk contains several options for dada2 that will be used in the processing of the trimmed reads. The settings are saved to a file with a time stamp. + +```{r, set_params, eval=FALSE} +# CHANGE ME according to the quality of the sequencing run. For good quality +# runs, analysing both R1/R2 is desirable. For very poor quality runs, it may +# be worthwhile to analyse only R1. Select from the following: "R01" "both" +analysis = "both" + +# CHANGE ME use "TRUE" to assign taxonomy, "FALSE" to proceed without taxonomic +# assignments +assign_taxonomy = "FALSE" + +# CHANGE ME use "TRUE" to plot quality profiles for each sample, and "FALSE" to +# speed up analysis by skipping this step (plotting takes some time) and "SUB" +# to plot a subset of 10 samples +plotQC = "SUB" + +# CHANGE ME according to the quality of the sequencing run. This determines the +# maximum expected errors for R1 and R2 during the filtering step. A reasonably +# conservative threshold is (2,2). If the data is of lower quality, it may be +# worthwhile to run with higher EEs ex/ (3,5) +my_maxEEf = 2 +my_maxEEr = 2 + +# CHANGE ME according to the quality of the sequencing run. This is a PHRED score +# quality treshold - all sequences will be truncated at the first base with a +# quality score below this value +my_truncQ = 2 + +# CHANGE ME according to the quality of the sequencing run and according to the +# length of the target region. This is the length to cut the (forward,reverse) +# sequences at. Use 0,0 for no truncation. +my_truncLen = c(0, 0) + +# CHANGE ME to trim the left (5') part of all reads (forward, reverse) if a dip +# is observable in the quality plots +left_trim = c(10, 10) + +# CHANGE ME according to the marker used - this is the minimum length of the +# reads after trimming +my_minLen = 100 + +# CHANGE ME specify the minimum number of bases to overlap during merging, must +# be over 10! +my_minoverlap = 30 + +# CHANGE ME to specify the minimum confidence interval for RDP assignment of +# taxonomy. +my_minBootstrap = 80 + +# SWR: CHANGE ME to TRUE and give a value (number of bases) to collapse ASVs +# with an overlap of 'Collaps_minOverlap' (recommended to be the length of the +# shortest ASV that is identical to a longer one). This is only recommended to +# be used in cases where a number of ASVs occur that differ in length but are +# otherwise identical. If this frequently occurs, it may be an issue of primer +# trimming. +Collapse_overlapping = FALSE +Collapse_minOverlap = 200 + +parameters <- paste( + paste("Run on", format(Sys.time(), "%d/%m/%y %H:%M")), + paste("Path: ",path),"Samples:", + paste(c(sapply(strsplit(basename(sort( + list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE))), "_"), `[`, 1)), collapse = ";"), + paste("Analysis type:",analysis), + paste("Database type:",tax_database), + "\ndada2 Parameters\n", + paste("maximum expected errors R1:",my_maxEEf), + paste("maximum expected errors R2:",my_maxEEr), + paste("minimum length: ",my_minLen), + paste("truncate at first instance of Qscore: ",my_truncQ), + paste("truncation at length (0 = no trunc)", + paste(my_truncLen), collapse = ","), + paste("minimum overlap during merging: ",my_minoverlap), + paste("taxonomy bootstrap threshold: ",my_minBootstrap), + paste("Collapse overlapping ASVs: ", Collapse_overlapping), + paste("Minimum overlap if collapsing overlapping ASVs:", + Collapse_minOverlap), + paste("Trimming 5' (R1, R2):", + paste(left_trim, collapse = ",")), + sep = "\n") +write.table(parameters, file.path(path, marker, paste0("parameters_",analysis, + format(Sys.time(), "%d-%m-%y_%H%M") ,".txt")), + row.names = FALSE) + +``` + +## Processing chunk +#### DO NOT ALTER SCRIPT BEYOND THIS POINT!!! + +Comment SWR: This chunk now passes the output directory to a global variable on Linux, so that the shell script chunk that generates the match list can access it. + +```{r, dada2, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker,"trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize = 100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +#Extract sample names +sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1) + +# begin commands for analysing R01/R02 together + +if (analysis == "both") { + + #Create paths for the filtered files in a subdirectory called filtered/ + filtFs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_F_filt.fastq")) + filtRs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_R_filt.fastq")) + + outp <- file.path(path, marker, "R01_R02") + + #Filter sequences + ##SWR: fixed paired end pairing by adding matchIDs=TRUE + out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, + maxEE = c(my_maxEEf,my_maxEEr), truncQ = my_truncQ, + minLen = my_minLen, trimLeft = left_trim, truncLen = my_truncLen, + rm.phix = TRUE, compress = TRUE, multithread = TRUE, + matchIDs = TRUE) + head(out) + write.table(out, file.path(path, marker, "R01_R02", "out.txt")) + + if (plotQC == "TRUE") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL),plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(c(fnRL),plotQualityProfile) + print(plr) + + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl,nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(sample(c(fnRL), sample_number), plotQualityProfile) + print(plr) + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # dereplicate + derepFs <- derepFastq(filtFs[out[,2] > 0], verbose = TRUE) + derepRs <- derepFastq(filtRs[out[,2] > 0], verbose = TRUE) + + # Name the derep-class objects by the sample names + names(derepFs) <- sample.names[out[,2] > 0] + names(derepRs) <- sample.names[out[,2] > 0] + + #Train dada2 to your dataset + errF <- learnErrors(filtFs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + errR <- learnErrors(filtRs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + + ## Plot the Estimated Error Rates for the Transition Types + #check that model and data match reasonably well + eFplot <- plotErrors(errF, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + eRplot <- plotErrors(errR, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R2_error_profile.pdf"), + eRplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # sample inference + dadaFs <- dada(derepFs, err = errF, multithread = TRUE) + dadaRs <- dada(derepRs, err = errR, multithread = TRUE) + + # merge forward and reverse reads + mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, + minOverlap = my_minoverlap, verbose = TRUE) + + # Inspect the merger data.frame from the first sample + head(mergers[[1]]) + merge.tab <- data.frame(matrix(ncol = 10, nrow = 0)) + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", "prefer", + "accept") + for (i in (1:length(mergers))) { + if (nrow(mergers[[i]]) > 0) { + sub <- cbind(names(mergers)[i], mergers[[i]]) + merge.tab <- rbind(merge.tab,sub) + } else {} + } + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", + "prefer", "accept") + + write.table(merge.tab, file.path(path, marker, "R01_R02", "mergers.txt"), + row.names = FALSE) + + + # make sequence table and distribution table for sequence lengths + seqtab <- makeSequenceTable(mergers) + # use only sequences longer than 50 bp as the rdp classifier can't accept less + # than that + seqtab <- seqtab[,nchar(colnames(seqtab)) > 49] + + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtab <- collapseNoMismatch(seqtab, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + + dim(seqtab) + table(nchar(getSequences(seqtab))) + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtab, file.path(path, marker, "R01_R02", "seqtab.rds")) + + + # remove chimeric sequences + seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtab.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtab.nochim)/sum(seqtab) + write.table(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab.nochim.txt")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab_nochim.rds")) + + # output table for each sample + # define a function that counts sequences in a file/object + getN <- function(x) sum(getUniques(x)) + if (min(out[,2]) > 0) { + track <- cbind(out, sapply(dadaFs, getN), + sapply(dadaRs, getN), + sapply(mergers, getN), + rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1]*100) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- sample.names + } else { + track <- cbind(out[out[,2] > 0,], sapply(dadaFs, getN), sapply(dadaRs, getN), + sapply(mergers, getN), ... = rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1][out[,2] > 0]*100) + + # If processing a single sample, remove the sapply calls: + # e.g. replace sapply(dadaFs, getN) with getN(dadaFs) + track <- rbind(track, cbind(out[,1][out[,2] == 0], + out[,2][out[,2] == 0], + NA,NA,NA,NA,NA)) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- c(sample.names[out[,2] > 0], sample.names[out[,2] == 0]) + } + + head(track) + write.table(track, file.path(path, marker, "R01_R02", "track.txt")) + + if (assign_taxonomy == "TRUE") { + taxa <- assignTaxonomy(seqtab.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxa[[1]]) == colnames(seqtab.nochim)) == nrow(taxa[[1]]) + tmp <- cbind(rownames(taxa[[1]]),taxa[[1]],taxa[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + tmp <- data.frame(tmp) + + write.table(tmp, file.path(path, marker, + "R01_R02", "taxonomy_ASVs_NC_R1R2.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxa$tax, file.path(path, marker, "R01_R02", "taxa.rds")) + + tmp2 <- cbind(t(seqtab.nochim), tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_R02", "seqtab.nochim_withtax.txt")) + + # write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + + # write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtab.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_withtax.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV",1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + } +} else { + + + ###################################################################################### + # Forward Reads Only + ################################################################################### + + # Create paths for the filtered files in a subdirectory called filtered/ + + filtFsR1 <- file.path(path, marker, "R01_only", "filtered", + paste0(sample.names, "_F_filt.fastq")) + + outp <- file.path(path, marker, "R01_only") + + # Filter sequences + outR1 <- filterAndTrim(fnFs, filtFsR1, + maxEE = my_maxEEf, truncQ = my_truncQ, minLen = my_minLen, + trimLeft = left_trim[1], rm.phix = TRUE, + compress = TRUE, multithread = TRUE) + head(outR1) + write.table(outR1, file.path(path, marker, "R01_only", "out.txt")) + + if (plotQC == "TRUE") { + #Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # Train dada2 to your dataset + errFR1 <- learnErrors(filtFsR1[outR1[,2] > 0], multithread = TRUE) + eFplot <- plotErrors(errFR1, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_only", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # dereplicate + derepFsR1 <- derepFastq(filtFsR1[outR1[,2] > 0], verbose = TRUE) + # Name the derep-class objects by the sample names + names(derepFsR1) <- sample.names[outR1[,2] > 0] + + # sample inference + dadaFsR1 <- dada(derepFsR1, err = errFR1, multithread = TRUE) + + + # make sequence table and distribution table for sequence lengths + seqtabR1 <- makeSequenceTable(dadaFsR1) + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtabR1 <- collapseNoMismatch(seqtabR1, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + dim(seqtabR1) + table(nchar(getSequences(seqtabR1))) + # RDP can't assign taxonomy to sequences shorter than 50bp, so filter to > 50bp + seqtabR1 <- seqtabR1[,nchar(colnames(seqtabR1)) > 49] + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtabR1, file.path(path, marker, "R01_only", "seqtab.rds")) + + # remove chimeric sequences + seqtabR1.nochim <- removeBimeraDenovo(seqtabR1, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtabR1.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtabR1.nochim)/sum(seqtabR1) + write.table(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtabR1.nochim")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtab_nochim.rds")) + + # output table for each sample + getN <- function(x) sum(getUniques(x)) + if (min(outR1[,2]) > 0) { + trackR1 <- cbind(outR1, sapply(dadaFsR1,getN), rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1]*100) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names) + } else { + trackR1 <- cbind(outR1[outR1[,2] > 0,], sapply(dadaFsR1, getN), + rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1][outR1[,2] > 0]*100) + trackR1 <- rbind(trackR1,cbind(outR1[,1][outR1[,2] == 0], + outR1[,2][outR1[,2] == 0], + NA,NA,NA)) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names[!outR1[,2] == 0], + sample.names[outR1[,2] == 0]) + } + + head(trackR1) + write.table(trackR1, file.path(path, marker, "R01_only", "trackR1.txt")) + + if (assign_taxonomy == "TRUE") { + taxaR1 <- assignTaxonomy(seqtabR1.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxaR1[[1]]) == colnames(seqtabR1.nochim)) == nrow(taxaR1[[1]]) + tmp <- cbind(rownames(taxaR1[[1]]), taxaR1[[1]], taxaR1[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + tmp <- data.frame(tmp) + write.table(tmp, file.path(path, marker, + "R01_only", "taxonomy_ASVs_NC_R1.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxaR1$tax, file.path(path, marker, "R01_only", "taxa.rds")) + + tmp2 <- cbind(t(seqtabR1.nochim),tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_only", "seqtabR1.nochim_withtax.txt")) + + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + + #write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtabR1.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_only", "ASVs_withtax_R1.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + } +} + +#make paths available to subsequent chunks +dada2_dir <- outp +#If Linux, make 'dada2_folder'available globally. + if (os == "Linux") { + Sys.setenv(dada2Folder = dada2_dir) + } +``` + diff --git a/analysis/raw_read_processing/FITS2_Dada2_pipeline_S-mod_v4-3.Rmd b/analysis/raw_read_processing/FITS2_Dada2_pipeline_S-mod_v4-3.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..5db1b3854fb14cc160570ec31d63dae74d66e472 --- /dev/null +++ b/analysis/raw_read_processing/FITS2_Dada2_pipeline_S-mod_v4-3.Rmd @@ -0,0 +1,869 @@ +--- +title: "dada2 microbiome pipeline v.4.2" +author: "Simeon Rossmann, Marie L. Davey" +date: "16.12.2019" +output: + pdf_document: default + html_document: default +last_modified: "Simeon Rossmann (SWR)" +date_modified: "20.11.2020" +version: 4.3 +--- + +This is an R markdown document for the analysis of MiSeq amplicon sequencing data. In order to customise the pipeline, search 'CHANGE ME' to find different settings you should adjust according to your data. All parameters requiring changes are located within the first 4 chunks. After the fourth chunk has been run, you can simply click 'Run'\>'Run all chunks below' in order to process the data + +Comment SWR: A new chunk was added to allow for quick reanalysis of taxonomy and pooled analysis of multiple samples after error inference. Small changes in chunk 4 (exporting seqtab.RDP in addition) were made to facilitate this. Therefore it's not recommended to 'Run all chunks below' from chunk 4 in this version. + +## Changelog + +### v4 update + +Comment SWR: Two more chunks were added to integrate [LULU](https://www.nature.com/articles/s41467-017-01312-x) post-processing into this pipeline. The first of the two new chunks is a shell script chunk that generates a match list (most similar ASVs for each ASV) using vsearch. The match list can also be generated by other means (f. eks. BLAST). The second of the new chunks is the core LULU chunk that takes the seqtab_nochim file from the dada2 analysis and the match list and performs the LULU post-processing with standard parameters. After this, a new seqtab_nochim, ASV-fasta and RDP taxonomy (optional) are generated in a lulu-subdirectory. + +#### v4.3 + +This update addresses inconsistencies in code usage between authors that were the likely cause of occasional warning messages and errors when creating files and directories in-function on some machines/setups. These problems occurred inconsistently and were not reproduced but seemed to be fixed by cleaner file path encoding. + +- minor code format fixes: + + - implementation of file path concatenation cleaned up by replacing all uses of `paste0()` for file paths with `file.path()` and removal of leading and closing slashes `/` + - replaced all instances of in-code object assignment using `=` with `<-` to clearly distinguish it from user assignable objects and argument assignment in functions. This means that objects assigned with `=` are intended for user input while objects assigned with `<-` are not to be directly manipulated by users. + +- other format fixes: + + - improved cosmetics when knitting by manually adjusted line breaks in code chunks + - added spaces surrounding `=` and following `,` to conform with best practices + - more consistent line spacing in code chunks + - Better text formatting in the Markdown text portions + +#### v4.2 + +- improved cutadapt with quality control and fwd primer removal, can be run through R in separate chunk +- filterAndTrim option expanded with trimLeft to remove the first n 5' bases regardless of content + +#### v4.1 + +- fixed an error when attempting to plot empty files +- fixed an error when attempting to sample more files than available for plotting + + +## Setup R Environment + +Comment SWR: If the OS is Linux, this chunk now passes the path and marker variables to the shell script chunk, making it unnecessary to specify this separately below and avoiding issues with misspelled variables. It is possible that this also works with Mac OS, but this was not included since no Mac machine was available for testing. + +```{r setup, eval=FALSE} +############set paths and marker name +# CHANGE ME to the directory (ABSOLUTE FILEPATH e.g. on Linux /home/usr/...) +# containing the fastq files after unzipping. +path = "/home/simeon/Documents/Marte_Metabarcoding/Run_August21" +#CHANGE ME to the marker you have sequenced +marker = c("FITS2") +# CHANGE ME to the path for the taxonomy database you will be using for +# identification (if any) +tax_database = "no/path/here" + +#set knitr chunk options +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(eval = FALSE) +knitr::opts_chunk$set(tidy = TRUE) +knitr::opts_chunk$set(root.dir = file.path(path)) + +#load libraries for R session +library(dada2) +library(phyloseq) +library(ggplot2) +library(Biostrings) +library(grid) +library(gridExtra) +library(ShortRead) +library(tidyverse) + +# Get info on the operating system. If Linux, make 'path' and 'marker' available +# globally, if not, warn that the script needs additional attention. +os <- Sys.info()['sysname'] +if (os == "Linux") { + Sys.setenv(projectpath = path) + Sys.setenv(marker = marker) +}else{ + cat(os, "Please run this script on Linux, + it is no longer tested and optimized for Windows") +} +``` + +## Primer presence and orientation check + +This chunk is based on the tutorial for [ITS workflows from the DADA2 GitHub page](https://benjjneb.github.io/dada2/ITS_workflow.html) and offers an assessment of primer presence before cutadapt processing. This should be done to check the orientation used in cutadapt primer removal. Raw fastq.gz files must be located in 'path' or 'path/marker' or 'path/marker/raw_data'. If fastq.gz files are in 'path/marker' or 'path/marker/raw_data', all files will be used in the analysis, if fastq.gz files are in 'path', you may choose whether to only use files that contain the marker in the file name or all files. + +```{r primer-check, eval=FALSE} +#CHANGE ME to number of samples to check. Numerical values (e.g. 1, 5, 10) or +# "all" are accepted inputs. +nSamp = 10 + +#CHANGE ME to TRUE if you only want to move files from 'path' to +# 'path/marker/raw_data' if they contain the marker in the file name. +# E.g. marker = 16S, files = 'Sxxx_16S_L001_R1.fastq.gz'. If FALSE, all fastq.gz +# files will be moved from 'path' to 'path/marker/raw_data'. All fastq.gz files +# will be moved from 'path/marker' to 'path/marker/raw_data' regardless of this. +marker_in_filename = FALSE + +#CHANGE ME by uncommenting the appropriate primer pair +#for the Oomycete ITS1 marker: + #Fprimer = "CGGAAGGATCATTACCAC" + #Rprimer = "AGCCTAGACATCCACTGCTG" +#for the Oomycete ITS2 marker: + #Fprimer = "TTGAACGCAYATTGCACTT" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Fungal ITS1 marker: + #Fprimer = "AAGTCGTAACAAGGTTTCC" + #Rprimer = "TGCVAGARCCAAGAGATC" +#for the Fungal ITS2 marker: + Fprimer = "GTGARTCATCGAATCTTTG" + Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Bacterial 16S marker: + #Fprimer = "GTGYCAGCMGCCGCGGTAA" + #Rprimer = "GGACTACNVGGGTWTCTAAT" +#for the plant Sper marker: + #rc_Fprimer = "GGGCAATCCTGAGCCAA" + #rc_Rprimer = "ATTGAGTCTCTGCACCTATC" +#for the plant Trac marker: + #Fprimer = "CRTTGCCGAGAGTC" + #Rprimer = "AAGTCGTAACAAGG" +#for the nematode Nem marker + #Fprimer = "ACGGTYAGAACTAGGG" + #Rprimer = "GAGGGCAAGTCTGGTG" + +#### NO CHANGES NEEDED BELOW #### +#Check if Marker folder exists, otherwise create it +dir.create(file.path(path, marker)) + +#Check if nSamp is numeric or "all", otherwise set it to "all" +nSamp <- ifelse(is.character(nSamp) | is.numeric(nSamp), nSamp, "all") +nSamp <- ifelse(nSamp == "all" | is.numeric(nSamp), nSamp, "all") + +# Move raw reads to their own subdirectory path/marker/raw_data. If files are in +# path, check whether some contain 'marker' in name. If they exist only move +# those, otherwise move all. +raw_dir <- file.path(path, marker, "raw_data") +dir.create(raw_dir) +if (length(list.files(file.path(path, marker), + pattern = ".fastq.gz", full.names = TRUE)) != 0) { + fnraw <- sort(list.files(file.path(path, marker), + pattern = "fastq.gz", full.names = TRUE)) +} + +if (length(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) != 0) { + if (marker_in_filename) { + fnrawp <- sort(list.files( + path, paste0(marker,"(.*).fastq.gz"), full.names = TRUE)) + }else{ + fnrawp <- sort(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) + } +} +if (exists("fnraw")) { + invisible(file.copy(fnraw, file.path(raw_dir, basename(fnraw)))) + invisible(file.remove(fnraw)) + rm(fnraw)} +if (exists("fnrawp")) { + invisible(file.copy(fnrawp, file.path(raw_dir, basename(fnrawp)))) + invisible(file.remove(fnrawp)) + rm(fnrawp)} + +#Create Lists of Forward and Reverse Filenames and a List of Sample Names +fnFs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +allOrients <- function(primer) { + # Create all orientations of the input sequence + require(Biostrings) + # Biostrings works w/ DNAString objects rather than character vectors + dna <- DNAString(primer) + orients <- c(Forward = dna, + Complement = complement(dna), + Reverse = reverse(dna), + RevComp = reverseComplement(dna)) + return(sapply(orients, toString)) # Convert back to character vector +} + +FWD.orients <- allOrients(Fprimer) +REV.orients <- allOrients(Rprimer) + +# Put N-filtered files in filtN/ subdirectory +fnFs.filtN <- file.path(path, marker, "filtN", basename(fnFs)) +fnRs.filtN <- file.path(path, marker, "filtN", basename(fnRs)) + +filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) + +fnFs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +primerHits <- function(primer, fn) { + # Counts number of reads in which the primer is found + nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) + return(sum(nhits > 0)) +} + +primerHits_perSample <- function(sampnum, fwd = fnFs, rev = fnRs){ + hit_table <- rbind(FWD_primer_R1_Reads = sapply( + FWD.orients, primerHits, fn = fwd[[sampnum]]), + FWD_primer_R2_Reads = sapply(FWD.orients, primerHits, fn = rev[[sampnum]]), + REV_primer_R1_Reads = sapply(REV.orients, primerHits, fn = fwd[[sampnum]]), + REV_primer_R2_Reads = sapply(REV.orients, primerHits, fn = rev[[sampnum]])) + hit_table <- as.data.frame(hit_table) + hit_table <- tibble::rownames_to_column(hit_table, "Primer_Read") + return(hit_table) +} + +nSamp <- ifelse(nSamp == "all", length(fnFs.filtN), nSamp) +nSamp <- ifelse(nSamp > length(fnFs.filtN), length(fnFs.filtN), nSamp) +Samps <- c(1:nSamp) + +primers_before <- lapply(Samps, primerHits_perSample, + fwd = fnFs.filtN, rev = fnRs.filtN) +names(primers_before) <- paste(basename(fnFs.filtN[1:nSamp]), + basename(fnRs.filtN[1:nSamp])) + +primers_before <- dplyr::bind_rows(primers_before, .id = "Sample") %>% + as_tibble() +write.table(primers_before, + file.path(path, marker, paste0("primers_before_cutadapt_", + nSamp, "Samples_", format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) +primers_before %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## Improved cutadapt + +More flexible implementation of the cutadapt processing, allows changing orientation of fwd and rev primers if the chunk above indicated that the orientation for one or both primers is not as expected. Expectation: FWD primer matches R1 in Forward orientation and/or R2 in RevComp orientation, REV primer matches R2 in Forward orientation and/or R1 in RevComp orientation. All other scenarios will require adjusting the primer orientation for cutadapt! This chunk offers a possibility to simply flip the FWD or REV primers separately if necessary but more complex issues will have to be resolved manually. + +Both, the new and old implementation of cutadapt wil NOT run successfully on a Windows system, even though the rest of the pipeline *can*. + +```{r cutadapt-improved, eval=FALSE} +# CHANGE ME to the path of your cutadapt installation. This can be found out by +# running 'which cutadapt' in the bash command line +cutadapt = "/home/simeon/miniconda3/envs/bioinfo/bin/cutadapt" + +# CHANGE ME to flip the FWD primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_FWD = FALSE + +#CHANGE ME to flip the REV primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_REV = FALSE + +#CHANGE ME to define the minimum length of reads after trimming. Reads shorter +# than this value will be discarded and won't appear in the output files. The +# description of this parameter in the deprecated chunk was not correct! +MIN_LENGTH = 15 + +##SWR: define a variable for the number of mismatches to be used in cutadapt +#CHANGE ME to the rate of mismatches (0 =< value < 1) you want cutadapt to +# accept when finding the primer. When choosing a value different from 0, it is +# recommended to compare how many sequences get trimmed when using both, 0 and +# the alternative value. +MISMATCH = 0 + +#### NO CHANGES NEEDED BELOW #### + +# This should print the version of cutadapt you're using if the path is given +# correctly and system2 can access cutadapt. +# THIS WILL NOT WORK IN A WINDOWS SYSTEM! +system2(cutadapt, args = "--version") + +#Make output directory for trimmed reads in path/marker/trim +trim_dir <- file.path(path, marker, "trim") +dir.create(trim_dir) +fnFs_trim <- file.path(trim_dir, paste0(basename(fnFs.filtN), ".trim.fastq.gz")) +fnRs_trim <- file.path(trim_dir, paste0(basename(fnRs.filtN), ".trim.fastq.gz")) + +#Define primers and Revcomps, Flip if specified +F_fwd <- ifelse(flip_FWD, FWD.orients["RevComp"], FWD.orients["Forward"]) +F_rc <- ifelse(flip_FWD, FWD.orients["Forward"], FWD.orients["RevComp"]) +R_fwd <- ifelse(flip_REV, REV.orients["RevComp"], REV.orients["Forward"]) +R_rc <- ifelse(flip_REV, REV.orients["Forward"], REV.orients["RevComp"]) + +# Trim FWD and the reverse-complement of REV off of R1 (forward reads) +R1.flags <- paste("-g", F_fwd, "-a", R_rc) +# Trim REV and the reverse-complement of FWD off of R2 (reverse reads) +R2.flags <- paste("-G", R_fwd, "-A", F_rc) + +# Run Cutadapt +for (i in seq_along(fnFs.filtN)) { + # -n 2 required to remove FWD and REV from reads + system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, + "-m", MIN_LENGTH, "-e", MISMATCH, + # output files + "-o", fnFs_trim[i], "-p", fnRs_trim[i], + # input files + fnFs.filtN[i], fnRs.filtN[i]), + stdout = file.path(trim_dir, paste0(basename(fnFs.filtN[[i]]), + "_cutadapt_output.txt"))) +} + +nSamp <- ifelse(nSamp > length(fnFs_trim), length(fnFs_trim), nSamp) +Samps <- c(1:nSamp) + +primers_after <- lapply(Samps, primerHits_perSample, + fwd = fnFs_trim, rev = fnRs_trim) +names(primers_after) <- paste(basename(fnFs_trim[1:nSamp]), + basename(fnRs_trim[1:nSamp])) + +primers_after <- dplyr::bind_rows(primers_after, .id = "Sample") +write.table(primers_after, file.path(path, marker, paste0( + "primers_after_cutadapt_", nSamp, "Samples_", + format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) + +primers_after %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## assess run quality +looks at a random sample of R1/R2 trimmed reads to assess run quality. + +```{r, QC, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize=100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 +sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + +plts_trim <- lapply(c(sample(fnFL, sample_number), sample(fnRL, sample_number)), + plotQualityProfile) + +m_plts <- marrangeGrob(plts_trim, nrow = 2, ncol = 1) +ggsave(file.path(path, marker, "trim", "qual_plots_after_cutadapt.pdf"), + m_plts, width = 20, height = 26, unit = "cm", dpi = 300) + +print(plts_trim) +``` + +## set parameters for dada2 +This chunk contains several options for dada2 that will be used in the processing of the trimmed reads. The settings are saved to a file with a time stamp. + +```{r, set_params, eval=FALSE} +# CHANGE ME according to the quality of the sequencing run. For good quality +# runs, analysing both R1/R2 is desirable. For very poor quality runs, it may +# be worthwhile to analyse only R1. Select from the following: "R01" "both" +analysis = "both" + +# CHANGE ME use "TRUE" to assign taxonomy, "FALSE" to proceed without taxonomic +# assignments +assign_taxonomy = "FALSE" + +# CHANGE ME use "TRUE" to plot quality profiles for each sample, and "FALSE" to +# speed up analysis by skipping this step (plotting takes some time) and "SUB" +# to plot a subset of 10 samples +plotQC = "SUB" + +# CHANGE ME according to the quality of the sequencing run. This determines the +# maximum expected errors for R1 and R2 during the filtering step. A reasonably +# conservative threshold is (2,2). If the data is of lower quality, it may be +# worthwhile to run with higher EEs ex/ (3,5) +my_maxEEf = 2 +my_maxEEr = 2 + +# CHANGE ME according to the quality of the sequencing run. This is a PHRED score +# quality treshold - all sequences will be truncated at the first base with a +# quality score below this value +my_truncQ = 2 + +# CHANGE ME according to the quality of the sequencing run and according to the +# length of the target region. This is the length to cut the (forward,reverse) +# sequences at. Use 0,0 for no truncation. +my_truncLen = c(0, 0) + +# CHANGE ME to trim the left (5') part of all reads (forward, reverse) if a dip +# is observable in the quality plots +left_trim = c(10, 10) + +# CHANGE ME according to the marker used - this is the minimum length of the +# reads after trimming +my_minLen = 100 + +# CHANGE ME specify the minimum number of bases to overlap during merging, must +# be over 10! +my_minoverlap = 30 + +# CHANGE ME to specify the minimum confidence interval for RDP assignment of +# taxonomy. +my_minBootstrap = 80 + +# SWR: CHANGE ME to TRUE and give a value (number of bases) to collapse ASVs +# with an overlap of 'Collaps_minOverlap' (recommended to be the length of the +# shortest ASV that is identical to a longer one). This is only recommended to +# be used in cases where a number of ASVs occur that differ in length but are +# otherwise identical. If this frequently occurs, it may be an issue of primer +# trimming. +Collapse_overlapping = FALSE +Collapse_minOverlap = 200 + +parameters <- paste( + paste("Run on", format(Sys.time(), "%d/%m/%y %H:%M")), + paste("Path: ",path),"Samples:", + paste(c(sapply(strsplit(basename(sort( + list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE))), "_"), `[`, 1)), collapse = ";"), + paste("Analysis type:",analysis), + paste("Database type:",tax_database), + "\ndada2 Parameters\n", + paste("maximum expected errors R1:",my_maxEEf), + paste("maximum expected errors R2:",my_maxEEr), + paste("minimum length: ",my_minLen), + paste("truncate at first instance of Qscore: ",my_truncQ), + paste("truncation at length (0 = no trunc)", + paste(my_truncLen), collapse = ","), + paste("minimum overlap during merging: ",my_minoverlap), + paste("taxonomy bootstrap threshold: ",my_minBootstrap), + paste("Collapse overlapping ASVs: ", Collapse_overlapping), + paste("Minimum overlap if collapsing overlapping ASVs:", + Collapse_minOverlap), + paste("Trimming 5' (R1, R2):", + paste(left_trim, collapse = ",")), + sep = "\n") +write.table(parameters, file.path(path, marker, paste0("parameters_",analysis, + format(Sys.time(), "%d-%m-%y_%H%M") ,".txt")), + row.names = FALSE) + +``` + +## Processing chunk +#### DO NOT ALTER SCRIPT BEYOND THIS POINT!!! + +Comment SWR: This chunk now passes the output directory to a global variable on Linux, so that the shell script chunk that generates the match list can access it. + +```{r, dada2, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker,"trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize = 100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +#Extract sample names +sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1) + +# begin commands for analysing R01/R02 together + +if (analysis == "both") { + + #Create paths for the filtered files in a subdirectory called filtered/ + filtFs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_F_filt.fastq")) + filtRs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_R_filt.fastq")) + + outp <- file.path(path, marker, "R01_R02") + + #Filter sequences + ##SWR: fixed paired end pairing by adding matchIDs=TRUE + out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, + maxEE = c(my_maxEEf,my_maxEEr), truncQ = my_truncQ, + minLen = my_minLen, trimLeft = left_trim, truncLen = my_truncLen, + rm.phix = TRUE, compress = TRUE, multithread = TRUE, + matchIDs = TRUE) + head(out) + write.table(out, file.path(path, marker, "R01_R02", "out.txt")) + + if (plotQC == "TRUE") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL),plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(c(fnRL),plotQualityProfile) + print(plr) + + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl,nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(sample(c(fnRL), sample_number), plotQualityProfile) + print(plr) + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # dereplicate + derepFs <- derepFastq(filtFs[out[,2] > 0], verbose = TRUE) + derepRs <- derepFastq(filtRs[out[,2] > 0], verbose = TRUE) + + # Name the derep-class objects by the sample names + names(derepFs) <- sample.names[out[,2] > 0] + names(derepRs) <- sample.names[out[,2] > 0] + + #Train dada2 to your dataset + errF <- learnErrors(filtFs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + errR <- learnErrors(filtRs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + + ## Plot the Estimated Error Rates for the Transition Types + #check that model and data match reasonably well + eFplot <- plotErrors(errF, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + eRplot <- plotErrors(errR, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R2_error_profile.pdf"), + eRplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # sample inference + dadaFs <- dada(derepFs, err = errF, multithread = TRUE) + dadaRs <- dada(derepRs, err = errR, multithread = TRUE) + + # merge forward and reverse reads + mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, + minOverlap = my_minoverlap, verbose = TRUE) + + # Inspect the merger data.frame from the first sample + head(mergers[[1]]) + merge.tab <- data.frame(matrix(ncol = 10, nrow = 0)) + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", "prefer", + "accept") + for (i in (1:length(mergers))) { + if (nrow(mergers[[i]]) > 0) { + sub <- cbind(names(mergers)[i], mergers[[i]]) + merge.tab <- rbind(merge.tab,sub) + } else {} + } + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", + "prefer", "accept") + + write.table(merge.tab, file.path(path, marker, "R01_R02", "mergers.txt"), + row.names = FALSE) + + + # make sequence table and distribution table for sequence lengths + seqtab <- makeSequenceTable(mergers) + # use only sequences longer than 50 bp as the rdp classifier can't accept less + # than that + seqtab <- seqtab[,nchar(colnames(seqtab)) > 49] + + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtab <- collapseNoMismatch(seqtab, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + + dim(seqtab) + table(nchar(getSequences(seqtab))) + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtab, file.path(path, marker, "R01_R02", "seqtab.rds")) + + + # remove chimeric sequences + seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtab.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtab.nochim)/sum(seqtab) + write.table(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab.nochim.txt")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab_nochim.rds")) + + # output table for each sample + # define a function that counts sequences in a file/object + getN <- function(x) sum(getUniques(x)) + if (min(out[,2]) > 0) { + track <- cbind(out, sapply(dadaFs, getN), + sapply(dadaRs, getN), + sapply(mergers, getN), + rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1]*100) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- sample.names + } else { + track <- cbind(out[out[,2] > 0,], sapply(dadaFs, getN), sapply(dadaRs, getN), + sapply(mergers, getN), ... = rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1][out[,2] > 0]*100) + + # If processing a single sample, remove the sapply calls: + # e.g. replace sapply(dadaFs, getN) with getN(dadaFs) + track <- rbind(track, cbind(out[,1][out[,2] == 0], + out[,2][out[,2] == 0], + NA,NA,NA,NA,NA)) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- c(sample.names[out[,2] > 0], sample.names[out[,2] == 0]) + } + + head(track) + write.table(track, file.path(path, marker, "R01_R02", "track.txt")) + + if (assign_taxonomy == "TRUE") { + taxa <- assignTaxonomy(seqtab.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxa[[1]]) == colnames(seqtab.nochim)) == nrow(taxa[[1]]) + tmp <- cbind(rownames(taxa[[1]]),taxa[[1]],taxa[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + tmp <- data.frame(tmp) + + write.table(tmp, file.path(path, marker, + "R01_R02", "taxonomy_ASVs_NC_R1R2.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxa$tax, file.path(path, marker, "R01_R02", "taxa.rds")) + + tmp2 <- cbind(t(seqtab.nochim), tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_R02", "seqtab.nochim_withtax.txt")) + + # write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + + # write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtab.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_withtax.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV",1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + } +} else { + + + ###################################################################################### + # Forward Reads Only + ################################################################################### + + # Create paths for the filtered files in a subdirectory called filtered/ + + filtFsR1 <- file.path(path, marker, "R01_only", "filtered", + paste0(sample.names, "_F_filt.fastq")) + + outp <- file.path(path, marker, "R01_only") + + # Filter sequences + outR1 <- filterAndTrim(fnFs, filtFsR1, + maxEE = my_maxEEf, truncQ = my_truncQ, minLen = my_minLen, + trimLeft = left_trim[1], rm.phix = TRUE, + compress = TRUE, multithread = TRUE) + head(outR1) + write.table(outR1, file.path(path, marker, "R01_only", "out.txt")) + + if (plotQC == "TRUE") { + #Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # Train dada2 to your dataset + errFR1 <- learnErrors(filtFsR1[outR1[,2] > 0], multithread = TRUE) + eFplot <- plotErrors(errFR1, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_only", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # dereplicate + derepFsR1 <- derepFastq(filtFsR1[outR1[,2] > 0], verbose = TRUE) + # Name the derep-class objects by the sample names + names(derepFsR1) <- sample.names[outR1[,2] > 0] + + # sample inference + dadaFsR1 <- dada(derepFsR1, err = errFR1, multithread = TRUE) + + + # make sequence table and distribution table for sequence lengths + seqtabR1 <- makeSequenceTable(dadaFsR1) + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtabR1 <- collapseNoMismatch(seqtabR1, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + dim(seqtabR1) + table(nchar(getSequences(seqtabR1))) + # RDP can't assign taxonomy to sequences shorter than 50bp, so filter to > 50bp + seqtabR1 <- seqtabR1[,nchar(colnames(seqtabR1)) > 49] + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtabR1, file.path(path, marker, "R01_only", "seqtab.rds")) + + # remove chimeric sequences + seqtabR1.nochim <- removeBimeraDenovo(seqtabR1, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtabR1.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtabR1.nochim)/sum(seqtabR1) + write.table(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtabR1.nochim")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtab_nochim.rds")) + + # output table for each sample + getN <- function(x) sum(getUniques(x)) + if (min(outR1[,2]) > 0) { + trackR1 <- cbind(outR1, sapply(dadaFsR1,getN), rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1]*100) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names) + } else { + trackR1 <- cbind(outR1[outR1[,2] > 0,], sapply(dadaFsR1, getN), + rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1][outR1[,2] > 0]*100) + trackR1 <- rbind(trackR1,cbind(outR1[,1][outR1[,2] == 0], + outR1[,2][outR1[,2] == 0], + NA,NA,NA)) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names[!outR1[,2] == 0], + sample.names[outR1[,2] == 0]) + } + + head(trackR1) + write.table(trackR1, file.path(path, marker, "R01_only", "trackR1.txt")) + + if (assign_taxonomy == "TRUE") { + taxaR1 <- assignTaxonomy(seqtabR1.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxaR1[[1]]) == colnames(seqtabR1.nochim)) == nrow(taxaR1[[1]]) + tmp <- cbind(rownames(taxaR1[[1]]), taxaR1[[1]], taxaR1[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + tmp <- data.frame(tmp) + write.table(tmp, file.path(path, marker, + "R01_only", "taxonomy_ASVs_NC_R1.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxaR1$tax, file.path(path, marker, "R01_only", "taxa.rds")) + + tmp2 <- cbind(t(seqtabR1.nochim),tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_only", "seqtabR1.nochim_withtax.txt")) + + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + + #write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtabR1.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_only", "ASVs_withtax_R1.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + } +} + +#make paths available to subsequent chunks +dada2_dir <- outp +#If Linux, make 'dada2_folder'available globally. + if (os == "Linux") { + Sys.setenv(dada2Folder = dada2_dir) + } +``` \ No newline at end of file diff --git a/analysis/raw_read_processing/Nems_Dada2_pipeline_S-mod_v4-3.Rmd b/analysis/raw_read_processing/Nems_Dada2_pipeline_S-mod_v4-3.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..b9427f95c160e176d3a48bdd5594b517efd04af9 --- /dev/null +++ b/analysis/raw_read_processing/Nems_Dada2_pipeline_S-mod_v4-3.Rmd @@ -0,0 +1,873 @@ +--- +title: "dada2 microbiome pipeline v.4.2" +author: "Simeon Rossmann, Marie L. Davey" +date: "16.12.2019" +output: + pdf_document: default + html_document: default +last_modified: "Simeon Rossmann (SWR)" +date_modified: "20.11.2020" +version: 4.3 +--- + +This is an R markdown document for the analysis of MiSeq amplicon sequencing data. In order to customise the pipeline, search 'CHANGE ME' to find different settings you should adjust according to your data. All parameters requiring changes are located within the first 4 chunks. After the fourth chunk has been run, you can simply click 'Run'\>'Run all chunks below' in order to process the data + +Comment SWR: A new chunk was added to allow for quick reanalysis of taxonomy and pooled analysis of multiple samples after error inference. Small changes in chunk 4 (exporting seqtab.RDP in addition) were made to facilitate this. Therefore it's not recommended to 'Run all chunks below' from chunk 4 in this version. + +## Changelog + +### v4 update + +Comment SWR: Two more chunks were added to integrate [LULU](https://www.nature.com/articles/s41467-017-01312-x) post-processing into this pipeline. The first of the two new chunks is a shell script chunk that generates a match list (most similar ASVs for each ASV) using vsearch. The match list can also be generated by other means (f. eks. BLAST). The second of the new chunks is the core LULU chunk that takes the seqtab_nochim file from the dada2 analysis and the match list and performs the LULU post-processing with standard parameters. After this, a new seqtab_nochim, ASV-fasta and RDP taxonomy (optional) are generated in a lulu-subdirectory. + +#### v4.3 + +This update addresses inconsistencies in code usage between authors that were the likely cause of occasional warning messages and errors when creating files and directories in-function on some machines/setups. These problems occurred inconsistently and were not reproduced but seemed to be fixed by cleaner file path encoding. + +- minor code format fixes: + + - implementation of file path concatenation cleaned up by replacing all uses of `paste0()` for file paths with `file.path()` and removal of leading and closing slashes `/` + - replaced all instances of in-code object assignment using `=` with `<-` to clearly distinguish it from user assignable objects and argument assignment in functions. This means that objects assigned with `=` are intended for user input while objects assigned with `<-` are not to be directly manipulated by users. + +- other format fixes: + + - improved cosmetics when knitting by manually adjusted line breaks in code chunks + - added spaces surrounding `=` and following `,` to conform with best practices + - more consistent line spacing in code chunks + - Better text formatting in the Markdown text portions + +#### v4.2 + +- improved cutadapt with quality control and fwd primer removal, can be run through R in separate chunk +- filterAndTrim option expanded with trimLeft to remove the first n 5' bases regardless of content + +#### v4.1 + +- fixed an error when attempting to plot empty files +- fixed an error when attempting to sample more files than available for plotting + + +## Setup R Environment + +Comment SWR: If the OS is Linux, this chunk now passes the path and marker variables to the shell script chunk, making it unnecessary to specify this separately below and avoiding issues with misspelled variables. It is possible that this also works with Mac OS, but this was not included since no Mac machine was available for testing. + +```{r setup, eval=FALSE} +############set paths and marker name +# CHANGE ME to the directory (ABSOLUTE FILEPATH e.g. on Linux /home/usr/...) +# containing the fastq files after unzipping. +path = "/home/simeon/Documents/Marte_Metabarcoding/Run_August21" +#CHANGE ME to the marker you have sequenced +marker = c("Nems") +# CHANGE ME to the path for the taxonomy database you will be using for +# identification (if any) +tax_database = "no/path/here" + +#set knitr chunk options +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(eval = FALSE) +knitr::opts_chunk$set(tidy = TRUE) +knitr::opts_chunk$set(root.dir = file.path(path)) + +#load libraries for R session +library(dada2) +library(phyloseq) +library(ggplot2) +library(Biostrings) +library(grid) +library(gridExtra) +library(ShortRead) +library(tidyverse) + +# Get info on the operating system. If Linux, make 'path' and 'marker' available +# globally, if not, warn that the script needs additional attention. +os <- Sys.info()['sysname'] +if (os == "Linux") { + Sys.setenv(projectpath = path) + Sys.setenv(marker = marker) +}else{ + cat(os, "Please run this script on Linux, + it is no longer tested and optimized for Windows") +} +``` + +## Primer presence and orientation check + +This chunk is based on the tutorial for [ITS workflows from the DADA2 GitHub page](https://benjjneb.github.io/dada2/ITS_workflow.html) and offers an assessment of primer presence before cutadapt processing. This should be done to check the orientation used in cutadapt primer removal. Raw fastq.gz files must be located in 'path' or 'path/marker' or 'path/marker/raw_data'. If fastq.gz files are in 'path/marker' or 'path/marker/raw_data', all files will be used in the analysis, if fastq.gz files are in 'path', you may choose whether to only use files that contain the marker in the file name or all files. + +```{r primer-check, eval=FALSE} +#CHANGE ME to number of samples to check. Numerical values (e.g. 1, 5, 10) or +# "all" are accepted inputs. +nSamp = 10 + +#CHANGE ME to TRUE if you only want to move files from 'path' to +# 'path/marker/raw_data' if they contain the marker in the file name. +# E.g. marker = 16S, files = 'Sxxx_16S_L001_R1.fastq.gz'. If FALSE, all fastq.gz +# files will be moved from 'path' to 'path/marker/raw_data'. All fastq.gz files +# will be moved from 'path/marker' to 'path/marker/raw_data' regardless of this. +marker_in_filename = FALSE + +#CHANGE ME by uncommenting the appropriate primer pair +#for the Oomycete ITS1 marker: + #Fprimer = "CGGAAGGATCATTACCAC" + #Rprimer = "AGCCTAGACATCCACTGCTG" +#for the Oomycete ITS2 marker: + #Fprimer = "TTGAACGCAYATTGCACTT" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Fungal ITS1 marker: + #Fprimer = "AAGTCGTAACAAGGTTTCC" + #Rprimer = "TGCVAGARCCAAGAGATC" +#for the Fungal ITS2 marker: + #Fprimer = "GTGARTCATCGAATCTTTG" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Bacterial 16S marker: + #Fprimer = "GTGYCAGCMGCCGCGGTAA" + #Rprimer = "GGACTACNVGGGTWTCTAAT" +#for the plant Sper marker: + #rc_Fprimer = "GGGCAATCCTGAGCCAA" + #rc_Rprimer = "ATTGAGTCTCTGCACCTATC" +#for the plant Trac marker: + #Fprimer = "CRTTGCCGAGAGTC" + #Rprimer = "AAGTCGTAACAAGG" +#for the nematode Nem marker + Fprimer = "GAGGGCAAGTCTGGTG" + Rprimer = "TTTACGGTYAGAACTAGGG" +#for the nematode Trich marker + #Fprimer = "GAGGGCAAGTCTGGTG" + #Rprimer = "TTTACGGTCTGAACTAAAG" + + +#### NO CHANGES NEEDED BELOW #### +#Check if Marker folder exists, otherwise create it +dir.create(file.path(path, marker)) + +#Check if nSamp is numeric or "all", otherwise set it to "all" +nSamp <- ifelse(is.character(nSamp) | is.numeric(nSamp), nSamp, "all") +nSamp <- ifelse(nSamp == "all" | is.numeric(nSamp), nSamp, "all") + +# Move raw reads to their own subdirectory path/marker/raw_data. If files are in +# path, check whether some contain 'marker' in name. If they exist only move +# those, otherwise move all. +raw_dir <- file.path(path, marker, "raw_data") +dir.create(raw_dir) +if (length(list.files(file.path(path, marker), + pattern = ".fastq.gz", full.names = TRUE)) != 0) { + fnraw <- sort(list.files(file.path(path, marker), + pattern = "fastq.gz", full.names = TRUE)) +} + +if (length(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) != 0) { + if (marker_in_filename) { + fnrawp <- sort(list.files( + path, paste0(marker,"(.*).fastq.gz"), full.names = TRUE)) + }else{ + fnrawp <- sort(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) + } +} +if (exists("fnraw")) { + invisible(file.copy(fnraw, file.path(raw_dir, basename(fnraw)))) + invisible(file.remove(fnraw)) + rm(fnraw)} +if (exists("fnrawp")) { + invisible(file.copy(fnrawp, file.path(raw_dir, basename(fnrawp)))) + invisible(file.remove(fnrawp)) + rm(fnrawp)} + +#Create Lists of Forward and Reverse Filenames and a List of Sample Names +fnFs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +allOrients <- function(primer) { + # Create all orientations of the input sequence + require(Biostrings) + # Biostrings works w/ DNAString objects rather than character vectors + dna <- DNAString(primer) + orients <- c(Forward = dna, + Complement = complement(dna), + Reverse = reverse(dna), + RevComp = reverseComplement(dna)) + return(sapply(orients, toString)) # Convert back to character vector +} + +FWD.orients <- allOrients(Fprimer) +REV.orients <- allOrients(Rprimer) + +# Put N-filtered files in filtN/ subdirectory +fnFs.filtN <- file.path(path, marker, "filtN", basename(fnFs)) +fnRs.filtN <- file.path(path, marker, "filtN", basename(fnRs)) + +filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) + +fnFs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +primerHits <- function(primer, fn) { + # Counts number of reads in which the primer is found + nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) + return(sum(nhits > 0)) +} + +primerHits_perSample <- function(sampnum, fwd = fnFs, rev = fnRs){ + hit_table <- rbind(FWD_primer_R1_Reads = sapply( + FWD.orients, primerHits, fn = fwd[[sampnum]]), + FWD_primer_R2_Reads = sapply(FWD.orients, primerHits, fn = rev[[sampnum]]), + REV_primer_R1_Reads = sapply(REV.orients, primerHits, fn = fwd[[sampnum]]), + REV_primer_R2_Reads = sapply(REV.orients, primerHits, fn = rev[[sampnum]])) + hit_table <- as.data.frame(hit_table) + hit_table <- tibble::rownames_to_column(hit_table, "Primer_Read") + return(hit_table) +} + +nSamp <- ifelse(nSamp == "all", length(fnFs.filtN), nSamp) +nSamp <- ifelse(nSamp > length(fnFs.filtN), length(fnFs.filtN), nSamp) +Samps <- c(1:nSamp) + +primers_before <- lapply(Samps, primerHits_perSample, + fwd = fnFs.filtN, rev = fnRs.filtN) +names(primers_before) <- paste(basename(fnFs.filtN[1:nSamp]), + basename(fnRs.filtN[1:nSamp])) + +primers_before <- dplyr::bind_rows(primers_before, .id = "Sample") %>% + as_tibble() +write.table(primers_before, + file.path(path, marker, paste0("primers_before_cutadapt_", + nSamp, "Samples_", format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) +primers_before %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## Improved cutadapt + +More flexible implementation of the cutadapt processing, allows changing orientation of fwd and rev primers if the chunk above indicated that the orientation for one or both primers is not as expected. Expectation: FWD primer matches R1 in Forward orientation and/or R2 in RevComp orientation, REV primer matches R2 in Forward orientation and/or R1 in RevComp orientation. All other scenarios will require adjusting the primer orientation for cutadapt! This chunk offers a possibility to simply flip the FWD or REV primers separately if necessary but more complex issues will have to be resolved manually. + +Both, the new and old implementation of cutadapt wil NOT run successfully on a Windows system, even though the rest of the pipeline *can*. + +```{r cutadapt-improved, eval=FALSE} +# CHANGE ME to the path of your cutadapt installation. This can be found out by +# running 'which cutadapt' in the bash command line +cutadapt = "/home/simeon/miniconda3/envs/bioinfo/bin/cutadapt" + +# CHANGE ME to flip the FWD primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_FWD = FALSE + +#CHANGE ME to flip the REV primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_REV = FALSE + +#CHANGE ME to define the minimum length of reads after trimming. Reads shorter +# than this value will be discarded and won't appear in the output files. The +# description of this parameter in the deprecated chunk was not correct! +MIN_LENGTH = 15 + +##SWR: define a variable for the number of mismatches to be used in cutadapt +#CHANGE ME to the rate of mismatches (0 =< value < 1) you want cutadapt to +# accept when finding the primer. When choosing a value different from 0, it is +# recommended to compare how many sequences get trimmed when using both, 0 and +# the alternative value. +MISMATCH = 0 + +#### NO CHANGES NEEDED BELOW #### + +# This should print the version of cutadapt you're using if the path is given +# correctly and system2 can access cutadapt. +# THIS WILL NOT WORK IN A WINDOWS SYSTEM! +system2(cutadapt, args = "--version") + +#Make output directory for trimmed reads in path/marker/trim +trim_dir <- file.path(path, marker, "trim") +dir.create(trim_dir) +fnFs_trim <- file.path(trim_dir, paste0(basename(fnFs.filtN), ".trim.fastq.gz")) +fnRs_trim <- file.path(trim_dir, paste0(basename(fnRs.filtN), ".trim.fastq.gz")) + +#Define primers and Revcomps, Flip if specified +F_fwd <- ifelse(flip_FWD, FWD.orients["RevComp"], FWD.orients["Forward"]) +F_rc <- ifelse(flip_FWD, FWD.orients["Forward"], FWD.orients["RevComp"]) +R_fwd <- ifelse(flip_REV, REV.orients["RevComp"], REV.orients["Forward"]) +R_rc <- ifelse(flip_REV, REV.orients["Forward"], REV.orients["RevComp"]) + +# Trim FWD and the reverse-complement of REV off of R1 (forward reads) +R1.flags <- paste("-g", F_fwd, "-a", R_rc) +# Trim REV and the reverse-complement of FWD off of R2 (reverse reads) +R2.flags <- paste("-G", R_fwd, "-A", F_rc) + +# Run Cutadapt +for (i in seq_along(fnFs.filtN)) { + # -n 2 required to remove FWD and REV from reads + system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, + "-m", MIN_LENGTH, "-e", MISMATCH, + # output files + "-o", fnFs_trim[i], "-p", fnRs_trim[i], + # input files + fnFs.filtN[i], fnRs.filtN[i]), + stdout = file.path(trim_dir, paste0(basename(fnFs.filtN[[i]]), + "_cutadapt_output.txt"))) +} + +nSamp <- ifelse(nSamp > length(fnFs_trim), length(fnFs_trim), nSamp) +Samps <- c(1:nSamp) + +primers_after <- lapply(Samps, primerHits_perSample, + fwd = fnFs_trim, rev = fnRs_trim) +names(primers_after) <- paste(basename(fnFs_trim[1:nSamp]), + basename(fnRs_trim[1:nSamp])) + +primers_after <- dplyr::bind_rows(primers_after, .id = "Sample") +write.table(primers_after, file.path(path, marker, paste0( + "primers_after_cutadapt_", nSamp, "Samples_", + format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) + +primers_after %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## assess run quality +looks at a random sample of R1/R2 trimmed reads to assess run quality. + +```{r, QC, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize=100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 +sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + +plts_trim <- lapply(c(sample(fnFL, sample_number), sample(fnRL, sample_number)), + plotQualityProfile) + +m_plts <- marrangeGrob(plts_trim, nrow = 2, ncol = 1) +ggsave(file.path(path, marker, "trim", "qual_plots_after_cutadapt.pdf"), + m_plts, width = 20, height = 26, unit = "cm", dpi = 300) + +print(plts_trim) +``` + +## set parameters for dada2 +This chunk contains several options for dada2 that will be used in the processing of the trimmed reads. The settings are saved to a file with a time stamp. + +```{r, set_params, eval=FALSE} +# CHANGE ME according to the quality of the sequencing run. For good quality +# runs, analysing both R1/R2 is desirable. For very poor quality runs, it may +# be worthwhile to analyse only R1. Select from the following: "R01" "both" +analysis = "both" + +# CHANGE ME use "TRUE" to assign taxonomy, "FALSE" to proceed without taxonomic +# assignments +assign_taxonomy = "FALSE" + +# CHANGE ME use "TRUE" to plot quality profiles for each sample, and "FALSE" to +# speed up analysis by skipping this step (plotting takes some time) and "SUB" +# to plot a subset of 10 samples +plotQC = "SUB" + +# CHANGE ME according to the quality of the sequencing run. This determines the +# maximum expected errors for R1 and R2 during the filtering step. A reasonably +# conservative threshold is (2,2). If the data is of lower quality, it may be +# worthwhile to run with higher EEs ex/ (3,5) +my_maxEEf = 2 +my_maxEEr = 2 + +# CHANGE ME according to the quality of the sequencing run. This is a PHRED score +# quality treshold - all sequences will be truncated at the first base with a +# quality score below this value +my_truncQ = 2 + +# CHANGE ME according to the quality of the sequencing run and according to the +# length of the target region. This is the length to cut the (forward,reverse) +# sequences at. Use 0,0 for no truncation. +my_truncLen = c(0, 0) + +# CHANGE ME to trim the left (5') part of all reads (forward, reverse) if a dip +# is observable in the quality plots +left_trim = c(10, 10) + +# CHANGE ME according to the marker used - this is the minimum length of the +# reads after trimming +my_minLen = 100 + +# CHANGE ME specify the minimum number of bases to overlap during merging, must +# be over 10! +my_minoverlap = 30 + +# CHANGE ME to specify the minimum confidence interval for RDP assignment of +# taxonomy. +my_minBootstrap = 80 + +# SWR: CHANGE ME to TRUE and give a value (number of bases) to collapse ASVs +# with an overlap of 'Collaps_minOverlap' (recommended to be the length of the +# shortest ASV that is identical to a longer one). This is only recommended to +# be used in cases where a number of ASVs occur that differ in length but are +# otherwise identical. If this frequently occurs, it may be an issue of primer +# trimming. +Collapse_overlapping = FALSE +Collapse_minOverlap = 200 + +parameters <- paste( + paste("Run on", format(Sys.time(), "%d/%m/%y %H:%M")), + paste("Path: ",path),"Samples:", + paste(c(sapply(strsplit(basename(sort( + list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE))), "_"), `[`, 1)), collapse = ";"), + paste("Analysis type:",analysis), + paste("Database type:",tax_database), + "\ndada2 Parameters\n", + paste("maximum expected errors R1:",my_maxEEf), + paste("maximum expected errors R2:",my_maxEEr), + paste("minimum length: ",my_minLen), + paste("truncate at first instance of Qscore: ",my_truncQ), + paste("truncation at length (0 = no trunc)", + paste(my_truncLen), collapse = ","), + paste("minimum overlap during merging: ",my_minoverlap), + paste("taxonomy bootstrap threshold: ",my_minBootstrap), + paste("Collapse overlapping ASVs: ", Collapse_overlapping), + paste("Minimum overlap if collapsing overlapping ASVs:", + Collapse_minOverlap), + paste("Trimming 5' (R1, R2):", + paste(left_trim, collapse = ",")), + sep = "\n") +write.table(parameters, file.path(path, marker, paste0("parameters_",analysis, + format(Sys.time(), "%d-%m-%y_%H%M") ,".txt")), + row.names = FALSE) + +``` + +## Processing chunk +#### DO NOT ALTER SCRIPT BEYOND THIS POINT!!! + +Comment SWR: This chunk now passes the output directory to a global variable on Linux, so that the shell script chunk that generates the match list can access it. + +```{r, dada2, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker,"trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize = 100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +#Extract sample names +sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1) + +# begin commands for analysing R01/R02 together + +if (analysis == "both") { + + #Create paths for the filtered files in a subdirectory called filtered/ + filtFs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_F_filt.fastq")) + filtRs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_R_filt.fastq")) + + outp <- file.path(path, marker, "R01_R02") + + #Filter sequences + ##SWR: fixed paired end pairing by adding matchIDs=TRUE + out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, + maxEE = c(my_maxEEf,my_maxEEr), truncQ = my_truncQ, + minLen = my_minLen, trimLeft = left_trim, truncLen = my_truncLen, + rm.phix = TRUE, compress = TRUE, multithread = TRUE, + matchIDs = TRUE) + head(out) + write.table(out, file.path(path, marker, "R01_R02", "out.txt")) + + if (plotQC == "TRUE") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL),plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(c(fnRL),plotQualityProfile) + print(plr) + + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl,nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(sample(c(fnRL), sample_number), plotQualityProfile) + print(plr) + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # dereplicate + derepFs <- derepFastq(filtFs[out[,2] > 0], verbose = TRUE) + derepRs <- derepFastq(filtRs[out[,2] > 0], verbose = TRUE) + + # Name the derep-class objects by the sample names + names(derepFs) <- sample.names[out[,2] > 0] + names(derepRs) <- sample.names[out[,2] > 0] + + #Train dada2 to your dataset + errF <- learnErrors(filtFs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + errR <- learnErrors(filtRs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + + ## Plot the Estimated Error Rates for the Transition Types + #check that model and data match reasonably well + eFplot <- plotErrors(errF, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + eRplot <- plotErrors(errR, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R2_error_profile.pdf"), + eRplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # sample inference + dadaFs <- dada(derepFs, err = errF, multithread = TRUE) + dadaRs <- dada(derepRs, err = errR, multithread = TRUE) + + # merge forward and reverse reads + mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, + minOverlap = my_minoverlap, verbose = TRUE) + + # Inspect the merger data.frame from the first sample + head(mergers[[1]]) + merge.tab <- data.frame(matrix(ncol = 10, nrow = 0)) + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", "prefer", + "accept") + for (i in (1:length(mergers))) { + if (nrow(mergers[[i]]) > 0) { + sub <- cbind(names(mergers)[i], mergers[[i]]) + merge.tab <- rbind(merge.tab,sub) + } else {} + } + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", + "prefer", "accept") + + write.table(merge.tab, file.path(path, marker, "R01_R02", "mergers.txt"), + row.names = FALSE) + + + # make sequence table and distribution table for sequence lengths + seqtab <- makeSequenceTable(mergers) + # use only sequences longer than 50 bp as the rdp classifier can't accept less + # than that + seqtab <- seqtab[,nchar(colnames(seqtab)) > 49] + + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtab <- collapseNoMismatch(seqtab, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + + dim(seqtab) + table(nchar(getSequences(seqtab))) + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtab, file.path(path, marker, "R01_R02", "seqtab.rds")) + + + # remove chimeric sequences + seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtab.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtab.nochim)/sum(seqtab) + write.table(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab.nochim.txt")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab_nochim.rds")) + + # output table for each sample + # define a function that counts sequences in a file/object + getN <- function(x) sum(getUniques(x)) + if (min(out[,2]) > 0) { + track <- cbind(out, sapply(dadaFs, getN), + sapply(dadaRs, getN), + sapply(mergers, getN), + rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1]*100) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- sample.names + } else { + track <- cbind(out[out[,2] > 0,], sapply(dadaFs, getN), sapply(dadaRs, getN), + sapply(mergers, getN), ... = rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1][out[,2] > 0]*100) + + # If processing a single sample, remove the sapply calls: + # e.g. replace sapply(dadaFs, getN) with getN(dadaFs) + track <- rbind(track, cbind(out[,1][out[,2] == 0], + out[,2][out[,2] == 0], + NA,NA,NA,NA,NA)) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- c(sample.names[out[,2] > 0], sample.names[out[,2] == 0]) + } + + head(track) + write.table(track, file.path(path, marker, "R01_R02", "track.txt")) + + if (assign_taxonomy == "TRUE") { + taxa <- assignTaxonomy(seqtab.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxa[[1]]) == colnames(seqtab.nochim)) == nrow(taxa[[1]]) + tmp <- cbind(rownames(taxa[[1]]),taxa[[1]],taxa[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + tmp <- data.frame(tmp) + + write.table(tmp, file.path(path, marker, + "R01_R02", "taxonomy_ASVs_NC_R1R2.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxa$tax, file.path(path, marker, "R01_R02", "taxa.rds")) + + tmp2 <- cbind(t(seqtab.nochim), tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_R02", "seqtab.nochim_withtax.txt")) + + # write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + + # write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtab.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_withtax.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV",1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + } +} else { + + + ###################################################################################### + # Forward Reads Only + ################################################################################### + + # Create paths for the filtered files in a subdirectory called filtered/ + + filtFsR1 <- file.path(path, marker, "R01_only", "filtered", + paste0(sample.names, "_F_filt.fastq")) + + outp <- file.path(path, marker, "R01_only") + + # Filter sequences + outR1 <- filterAndTrim(fnFs, filtFsR1, + maxEE = my_maxEEf, truncQ = my_truncQ, minLen = my_minLen, + trimLeft = left_trim[1], rm.phix = TRUE, + compress = TRUE, multithread = TRUE) + head(outR1) + write.table(outR1, file.path(path, marker, "R01_only", "out.txt")) + + if (plotQC == "TRUE") { + #Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # Train dada2 to your dataset + errFR1 <- learnErrors(filtFsR1[outR1[,2] > 0], multithread = TRUE) + eFplot <- plotErrors(errFR1, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_only", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # dereplicate + derepFsR1 <- derepFastq(filtFsR1[outR1[,2] > 0], verbose = TRUE) + # Name the derep-class objects by the sample names + names(derepFsR1) <- sample.names[outR1[,2] > 0] + + # sample inference + dadaFsR1 <- dada(derepFsR1, err = errFR1, multithread = TRUE) + + + # make sequence table and distribution table for sequence lengths + seqtabR1 <- makeSequenceTable(dadaFsR1) + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtabR1 <- collapseNoMismatch(seqtabR1, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + dim(seqtabR1) + table(nchar(getSequences(seqtabR1))) + # RDP can't assign taxonomy to sequences shorter than 50bp, so filter to > 50bp + seqtabR1 <- seqtabR1[,nchar(colnames(seqtabR1)) > 49] + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtabR1, file.path(path, marker, "R01_only", "seqtab.rds")) + + # remove chimeric sequences + seqtabR1.nochim <- removeBimeraDenovo(seqtabR1, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtabR1.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtabR1.nochim)/sum(seqtabR1) + write.table(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtabR1.nochim")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtab_nochim.rds")) + + # output table for each sample + getN <- function(x) sum(getUniques(x)) + if (min(outR1[,2]) > 0) { + trackR1 <- cbind(outR1, sapply(dadaFsR1,getN), rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1]*100) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names) + } else { + trackR1 <- cbind(outR1[outR1[,2] > 0,], sapply(dadaFsR1, getN), + rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1][outR1[,2] > 0]*100) + trackR1 <- rbind(trackR1,cbind(outR1[,1][outR1[,2] == 0], + outR1[,2][outR1[,2] == 0], + NA,NA,NA)) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names[!outR1[,2] == 0], + sample.names[outR1[,2] == 0]) + } + + head(trackR1) + write.table(trackR1, file.path(path, marker, "R01_only", "trackR1.txt")) + + if (assign_taxonomy == "TRUE") { + taxaR1 <- assignTaxonomy(seqtabR1.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxaR1[[1]]) == colnames(seqtabR1.nochim)) == nrow(taxaR1[[1]]) + tmp <- cbind(rownames(taxaR1[[1]]), taxaR1[[1]], taxaR1[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + tmp <- data.frame(tmp) + write.table(tmp, file.path(path, marker, + "R01_only", "taxonomy_ASVs_NC_R1.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxaR1$tax, file.path(path, marker, "R01_only", "taxa.rds")) + + tmp2 <- cbind(t(seqtabR1.nochim),tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_only", "seqtabR1.nochim_withtax.txt")) + + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + + #write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtabR1.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_only", "ASVs_withtax_R1.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + } +} + +#make paths available to subsequent chunks +dada2_dir <- outp +#If Linux, make 'dada2_folder'available globally. + if (os == "Linux") { + Sys.setenv(dada2Folder = dada2_dir) + } +``` \ No newline at end of file diff --git a/analysis/raw_read_processing/OITS_Dada2_pipeline_S-mod_v4-3.Rmd b/analysis/raw_read_processing/OITS_Dada2_pipeline_S-mod_v4-3.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..f7471f4a3f7557f4547baa051c66b104a67ddcf9 --- /dev/null +++ b/analysis/raw_read_processing/OITS_Dada2_pipeline_S-mod_v4-3.Rmd @@ -0,0 +1,869 @@ +--- +title: "dada2 microbiome pipeline v.4.2" +author: "Simeon Rossmann, Marie L. Davey" +date: "16.12.2019" +output: + pdf_document: default + html_document: default +last_modified: "Simeon Rossmann (SWR)" +date_modified: "20.11.2020" +version: 4.3 +--- + +This is an R markdown document for the analysis of MiSeq amplicon sequencing data. In order to customise the pipeline, search 'CHANGE ME' to find different settings you should adjust according to your data. All parameters requiring changes are located within the first 4 chunks. After the fourth chunk has been run, you can simply click 'Run'\>'Run all chunks below' in order to process the data + +Comment SWR: A new chunk was added to allow for quick reanalysis of taxonomy and pooled analysis of multiple samples after error inference. Small changes in chunk 4 (exporting seqtab.RDP in addition) were made to facilitate this. Therefore it's not recommended to 'Run all chunks below' from chunk 4 in this version. + +## Changelog + +### v4 update + +Comment SWR: Two more chunks were added to integrate [LULU](https://www.nature.com/articles/s41467-017-01312-x) post-processing into this pipeline. The first of the two new chunks is a shell script chunk that generates a match list (most similar ASVs for each ASV) using vsearch. The match list can also be generated by other means (f. eks. BLAST). The second of the new chunks is the core LULU chunk that takes the seqtab_nochim file from the dada2 analysis and the match list and performs the LULU post-processing with standard parameters. After this, a new seqtab_nochim, ASV-fasta and RDP taxonomy (optional) are generated in a lulu-subdirectory. + +#### v4.3 + +This update addresses inconsistencies in code usage between authors that were the likely cause of occasional warning messages and errors when creating files and directories in-function on some machines/setups. These problems occurred inconsistently and were not reproduced but seemed to be fixed by cleaner file path encoding. + +- minor code format fixes: + + - implementation of file path concatenation cleaned up by replacing all uses of `paste0()` for file paths with `file.path()` and removal of leading and closing slashes `/` + - replaced all instances of in-code object assignment using `=` with `<-` to clearly distinguish it from user assignable objects and argument assignment in functions. This means that objects assigned with `=` are intended for user input while objects assigned with `<-` are not to be directly manipulated by users. + +- other format fixes: + + - improved cosmetics when knitting by manually adjusted line breaks in code chunks + - added spaces surrounding `=` and following `,` to conform with best practices + - more consistent line spacing in code chunks + - Better text formatting in the Markdown text portions + +#### v4.2 + +- improved cutadapt with quality control and fwd primer removal, can be run through R in separate chunk +- filterAndTrim option expanded with trimLeft to remove the first n 5' bases regardless of content + +#### v4.1 + +- fixed an error when attempting to plot empty files +- fixed an error when attempting to sample more files than available for plotting + + +## Setup R Environment + +Comment SWR: If the OS is Linux, this chunk now passes the path and marker variables to the shell script chunk, making it unnecessary to specify this separately below and avoiding issues with misspelled variables. It is possible that this also works with Mac OS, but this was not included since no Mac machine was available for testing. + +```{r setup, eval=FALSE} +############set paths and marker name +# CHANGE ME to the directory (ABSOLUTE FILEPATH e.g. on Linux /home/usr/...) +# containing the fastq files after unzipping. +path = "/home/simeon/Documents/Marte_Metabarcoding/Run_August21" +#CHANGE ME to the marker you have sequenced +marker = c("OITS") +# CHANGE ME to the path for the taxonomy database you will be using for +# identification (if any) +tax_database = "no/path/here" + +#set knitr chunk options +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(eval = FALSE) +knitr::opts_chunk$set(tidy = TRUE) +knitr::opts_chunk$set(root.dir = file.path(path)) + +#load libraries for R session +library(dada2) +library(phyloseq) +library(ggplot2) +library(Biostrings) +library(grid) +library(gridExtra) +library(ShortRead) +library(tidyverse) + +# Get info on the operating system. If Linux, make 'path' and 'marker' available +# globally, if not, warn that the script needs additional attention. +os <- Sys.info()['sysname'] +if (os == "Linux") { + Sys.setenv(projectpath = path) + Sys.setenv(marker = marker) +}else{ + cat(os, "Please run this script on Linux, + it is no longer tested and optimized for Windows") +} +``` + +## Primer presence and orientation check + +This chunk is based on the tutorial for [ITS workflows from the DADA2 GitHub page](https://benjjneb.github.io/dada2/ITS_workflow.html) and offers an assessment of primer presence before cutadapt processing. This should be done to check the orientation used in cutadapt primer removal. Raw fastq.gz files must be located in 'path' or 'path/marker' or 'path/marker/raw_data'. If fastq.gz files are in 'path/marker' or 'path/marker/raw_data', all files will be used in the analysis, if fastq.gz files are in 'path', you may choose whether to only use files that contain the marker in the file name or all files. + +```{r primer-check, eval=FALSE} +#CHANGE ME to number of samples to check. Numerical values (e.g. 1, 5, 10) or +# "all" are accepted inputs. +nSamp = 10 + +#CHANGE ME to TRUE if you only want to move files from 'path' to +# 'path/marker/raw_data' if they contain the marker in the file name. +# E.g. marker = 16S, files = 'Sxxx_16S_L001_R1.fastq.gz'. If FALSE, all fastq.gz +# files will be moved from 'path' to 'path/marker/raw_data'. All fastq.gz files +# will be moved from 'path/marker' to 'path/marker/raw_data' regardless of this. +marker_in_filename = FALSE + +#CHANGE ME by uncommenting the appropriate primer pair +#for the Oomycete ITS1 marker: + Fprimer = "CGGAAGGATCATTACCAC" + Rprimer = "AGCCTAGACATCCACTGCTG" +#for the Oomycete ITS2 marker: + #Fprimer = "TTGAACGCAYATTGCACTT" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Fungal ITS1 marker: + #Fprimer = "AAGTCGTAACAAGGTTTCC" + #Rprimer = "TGCVAGARCCAAGAGATC" +#for the Fungal ITS2 marker: + #Fprimer = "GTGARTCATCGAATCTTTG" + #Rprimer = "TCCTCCGCTTATTGATATGC" +#for the Bacterial 16S marker: + #Fprimer = "GTGYCAGCMGCCGCGGTAA" + #Rprimer = "GGACTACNVGGGTWTCTAAT" +#for the plant Sper marker: + #rc_Fprimer = "GGGCAATCCTGAGCCAA" + #rc_Rprimer = "ATTGAGTCTCTGCACCTATC" +#for the plant Trac marker: + #Fprimer = "CRTTGCCGAGAGTC" + #Rprimer = "AAGTCGTAACAAGG" +#for the nematode Nem marker + #Fprimer = "ACGGTYAGAACTAGGG" + #Rprimer = "GAGGGCAAGTCTGGTG" + +#### NO CHANGES NEEDED BELOW #### +#Check if Marker folder exists, otherwise create it +dir.create(file.path(path, marker)) + +#Check if nSamp is numeric or "all", otherwise set it to "all" +nSamp <- ifelse(is.character(nSamp) | is.numeric(nSamp), nSamp, "all") +nSamp <- ifelse(nSamp == "all" | is.numeric(nSamp), nSamp, "all") + +# Move raw reads to their own subdirectory path/marker/raw_data. If files are in +# path, check whether some contain 'marker' in name. If they exist only move +# those, otherwise move all. +raw_dir <- file.path(path, marker, "raw_data") +dir.create(raw_dir) +if (length(list.files(file.path(path, marker), + pattern = ".fastq.gz", full.names = TRUE)) != 0) { + fnraw <- sort(list.files(file.path(path, marker), + pattern = "fastq.gz", full.names = TRUE)) +} + +if (length(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) != 0) { + if (marker_in_filename) { + fnrawp <- sort(list.files( + path, paste0(marker,"(.*).fastq.gz"), full.names = TRUE)) + }else{ + fnrawp <- sort(list.files(path, pattern = ".fastq.gz", full.names = TRUE)) + } +} +if (exists("fnraw")) { + invisible(file.copy(fnraw, file.path(raw_dir, basename(fnraw)))) + invisible(file.remove(fnraw)) + rm(fnraw)} +if (exists("fnrawp")) { + invisible(file.copy(fnrawp, file.path(raw_dir, basename(fnrawp)))) + invisible(file.remove(fnrawp)) + rm(fnrawp)} + +#Create Lists of Forward and Reverse Filenames and a List of Sample Names +fnFs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "raw_data"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +allOrients <- function(primer) { + # Create all orientations of the input sequence + require(Biostrings) + # Biostrings works w/ DNAString objects rather than character vectors + dna <- DNAString(primer) + orients <- c(Forward = dna, + Complement = complement(dna), + Reverse = reverse(dna), + RevComp = reverseComplement(dna)) + return(sapply(orients, toString)) # Convert back to character vector +} + +FWD.orients <- allOrients(Fprimer) +REV.orients <- allOrients(Rprimer) + +# Put N-filtered files in filtN/ subdirectory +fnFs.filtN <- file.path(path, marker, "filtN", basename(fnFs)) +fnRs.filtN <- file.path(path, marker, "filtN", basename(fnRs)) + +filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) + +fnFs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs.filtN <- sort(list.files(file.path(path, marker, "filtN"), + pattern = "_R2_001.fastq.gz", full.names = TRUE)) + +primerHits <- function(primer, fn) { + # Counts number of reads in which the primer is found + nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) + return(sum(nhits > 0)) +} + +primerHits_perSample <- function(sampnum, fwd = fnFs, rev = fnRs){ + hit_table <- rbind(FWD_primer_R1_Reads = sapply( + FWD.orients, primerHits, fn = fwd[[sampnum]]), + FWD_primer_R2_Reads = sapply(FWD.orients, primerHits, fn = rev[[sampnum]]), + REV_primer_R1_Reads = sapply(REV.orients, primerHits, fn = fwd[[sampnum]]), + REV_primer_R2_Reads = sapply(REV.orients, primerHits, fn = rev[[sampnum]])) + hit_table <- as.data.frame(hit_table) + hit_table <- tibble::rownames_to_column(hit_table, "Primer_Read") + return(hit_table) +} + +nSamp <- ifelse(nSamp == "all", length(fnFs.filtN), nSamp) +nSamp <- ifelse(nSamp > length(fnFs.filtN), length(fnFs.filtN), nSamp) +Samps <- c(1:nSamp) + +primers_before <- lapply(Samps, primerHits_perSample, + fwd = fnFs.filtN, rev = fnRs.filtN) +names(primers_before) <- paste(basename(fnFs.filtN[1:nSamp]), + basename(fnRs.filtN[1:nSamp])) + +primers_before <- dplyr::bind_rows(primers_before, .id = "Sample") %>% + as_tibble() +write.table(primers_before, + file.path(path, marker, paste0("primers_before_cutadapt_", + nSamp, "Samples_", format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) +primers_before %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## Improved cutadapt + +More flexible implementation of the cutadapt processing, allows changing orientation of fwd and rev primers if the chunk above indicated that the orientation for one or both primers is not as expected. Expectation: FWD primer matches R1 in Forward orientation and/or R2 in RevComp orientation, REV primer matches R2 in Forward orientation and/or R1 in RevComp orientation. All other scenarios will require adjusting the primer orientation for cutadapt! This chunk offers a possibility to simply flip the FWD or REV primers separately if necessary but more complex issues will have to be resolved manually. + +Both, the new and old implementation of cutadapt wil NOT run successfully on a Windows system, even though the rest of the pipeline *can*. + +```{r cutadapt-improved, eval=FALSE} +# CHANGE ME to the path of your cutadapt installation. This can be found out by +# running 'which cutadapt' in the bash command line +cutadapt = "/home/simeon/miniconda3/envs/bioinfo/bin/cutadapt" + +# CHANGE ME to flip the FWD primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_FWD = FALSE + +#CHANGE ME to flip the REV primer. Should be FALSE if the orientation was as +# expected in the chunk above. +flip_REV = FALSE + +#CHANGE ME to define the minimum length of reads after trimming. Reads shorter +# than this value will be discarded and won't appear in the output files. The +# description of this parameter in the deprecated chunk was not correct! +MIN_LENGTH = 15 + +##SWR: define a variable for the number of mismatches to be used in cutadapt +#CHANGE ME to the rate of mismatches (0 =< value < 1) you want cutadapt to +# accept when finding the primer. When choosing a value different from 0, it is +# recommended to compare how many sequences get trimmed when using both, 0 and +# the alternative value. +MISMATCH = 0 + +#### NO CHANGES NEEDED BELOW #### + +# This should print the version of cutadapt you're using if the path is given +# correctly and system2 can access cutadapt. +# THIS WILL NOT WORK IN A WINDOWS SYSTEM! +system2(cutadapt, args = "--version") + +#Make output directory for trimmed reads in path/marker/trim +trim_dir <- file.path(path, marker, "trim") +dir.create(trim_dir) +fnFs_trim <- file.path(trim_dir, paste0(basename(fnFs.filtN), ".trim.fastq.gz")) +fnRs_trim <- file.path(trim_dir, paste0(basename(fnRs.filtN), ".trim.fastq.gz")) + +#Define primers and Revcomps, Flip if specified +F_fwd <- ifelse(flip_FWD, FWD.orients["RevComp"], FWD.orients["Forward"]) +F_rc <- ifelse(flip_FWD, FWD.orients["Forward"], FWD.orients["RevComp"]) +R_fwd <- ifelse(flip_REV, REV.orients["RevComp"], REV.orients["Forward"]) +R_rc <- ifelse(flip_REV, REV.orients["Forward"], REV.orients["RevComp"]) + +# Trim FWD and the reverse-complement of REV off of R1 (forward reads) +R1.flags <- paste("-g", F_fwd, "-a", R_rc) +# Trim REV and the reverse-complement of FWD off of R2 (reverse reads) +R2.flags <- paste("-G", R_fwd, "-A", F_rc) + +# Run Cutadapt +for (i in seq_along(fnFs.filtN)) { + # -n 2 required to remove FWD and REV from reads + system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, + "-m", MIN_LENGTH, "-e", MISMATCH, + # output files + "-o", fnFs_trim[i], "-p", fnRs_trim[i], + # input files + fnFs.filtN[i], fnRs.filtN[i]), + stdout = file.path(trim_dir, paste0(basename(fnFs.filtN[[i]]), + "_cutadapt_output.txt"))) +} + +nSamp <- ifelse(nSamp > length(fnFs_trim), length(fnFs_trim), nSamp) +Samps <- c(1:nSamp) + +primers_after <- lapply(Samps, primerHits_perSample, + fwd = fnFs_trim, rev = fnRs_trim) +names(primers_after) <- paste(basename(fnFs_trim[1:nSamp]), + basename(fnRs_trim[1:nSamp])) + +primers_after <- dplyr::bind_rows(primers_after, .id = "Sample") +write.table(primers_after, file.path(path, marker, paste0( + "primers_after_cutadapt_", nSamp, "Samples_", + format(Sys.time(), "%d-%m-%y_%H%M"), + ".txt")), row.names = FALSE) + +primers_after %>% + select(-Sample) %>% + group_by(Primer_Read) %>% + summarise_all(mean) +``` + +## assess run quality +looks at a random sample of R1/R2 trimmed reads to assess run quality. + +```{r, QC, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize=100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 +sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + +plts_trim <- lapply(c(sample(fnFL, sample_number), sample(fnRL, sample_number)), + plotQualityProfile) + +m_plts <- marrangeGrob(plts_trim, nrow = 2, ncol = 1) +ggsave(file.path(path, marker, "trim", "qual_plots_after_cutadapt.pdf"), + m_plts, width = 20, height = 26, unit = "cm", dpi = 300) + +print(plts_trim) +``` + +## set parameters for dada2 +This chunk contains several options for dada2 that will be used in the processing of the trimmed reads. The settings are saved to a file with a time stamp. + +```{r, set_params, eval=FALSE} +# CHANGE ME according to the quality of the sequencing run. For good quality +# runs, analysing both R1/R2 is desirable. For very poor quality runs, it may +# be worthwhile to analyse only R1. Select from the following: "R01" "both" +analysis = "both" + +# CHANGE ME use "TRUE" to assign taxonomy, "FALSE" to proceed without taxonomic +# assignments +assign_taxonomy = "FALSE" + +# CHANGE ME use "TRUE" to plot quality profiles for each sample, and "FALSE" to +# speed up analysis by skipping this step (plotting takes some time) and "SUB" +# to plot a subset of 10 samples +plotQC = "SUB" + +# CHANGE ME according to the quality of the sequencing run. This determines the +# maximum expected errors for R1 and R2 during the filtering step. A reasonably +# conservative threshold is (2,2). If the data is of lower quality, it may be +# worthwhile to run with higher EEs ex/ (3,5) +my_maxEEf = 2 +my_maxEEr = 2 + +# CHANGE ME according to the quality of the sequencing run. This is a PHRED score +# quality treshold - all sequences will be truncated at the first base with a +# quality score below this value +my_truncQ = 2 + +# CHANGE ME according to the quality of the sequencing run and according to the +# length of the target region. This is the length to cut the (forward,reverse) +# sequences at. Use 0,0 for no truncation. +my_truncLen = c(0, 0) + +# CHANGE ME to trim the left (5') part of all reads (forward, reverse) if a dip +# is observable in the quality plots +left_trim = c(10, 10) + +# CHANGE ME according to the marker used - this is the minimum length of the +# reads after trimming +my_minLen = 100 + +# CHANGE ME specify the minimum number of bases to overlap during merging, must +# be over 10! +my_minoverlap = 30 + +# CHANGE ME to specify the minimum confidence interval for RDP assignment of +# taxonomy. +my_minBootstrap = 80 + +# SWR: CHANGE ME to TRUE and give a value (number of bases) to collapse ASVs +# with an overlap of 'Collaps_minOverlap' (recommended to be the length of the +# shortest ASV that is identical to a longer one). This is only recommended to +# be used in cases where a number of ASVs occur that differ in length but are +# otherwise identical. If this frequently occurs, it may be an issue of primer +# trimming. +Collapse_overlapping = FALSE +Collapse_minOverlap = 200 + +parameters <- paste( + paste("Run on", format(Sys.time(), "%d/%m/%y %H:%M")), + paste("Path: ",path),"Samples:", + paste(c(sapply(strsplit(basename(sort( + list.files(file.path(path, marker, "trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE))), "_"), `[`, 1)), collapse = ";"), + paste("Analysis type:",analysis), + paste("Database type:",tax_database), + "\ndada2 Parameters\n", + paste("maximum expected errors R1:",my_maxEEf), + paste("maximum expected errors R2:",my_maxEEr), + paste("minimum length: ",my_minLen), + paste("truncate at first instance of Qscore: ",my_truncQ), + paste("truncation at length (0 = no trunc)", + paste(my_truncLen), collapse = ","), + paste("minimum overlap during merging: ",my_minoverlap), + paste("taxonomy bootstrap threshold: ",my_minBootstrap), + paste("Collapse overlapping ASVs: ", Collapse_overlapping), + paste("Minimum overlap if collapsing overlapping ASVs:", + Collapse_minOverlap), + paste("Trimming 5' (R1, R2):", + paste(left_trim, collapse = ",")), + sep = "\n") +write.table(parameters, file.path(path, marker, paste0("parameters_",analysis, + format(Sys.time(), "%d-%m-%y_%H%M") ,".txt")), + row.names = FALSE) + +``` + +## Processing chunk +#### DO NOT ALTER SCRIPT BEYOND THIS POINT!!! + +Comment SWR: This chunk now passes the output directory to a global variable on Linux, so that the shell script chunk that generates the match list can access it. + +```{r, dada2, eval=FALSE} +# Create lists of forward and reverse file names and a list of sample names +fnFs <- sort(list.files(file.path(path, marker,"trim"), + pattern = "_R1_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) +fnRs <- sort(list.files(file.path(path, marker, "trim"), + pattern = "_R2_001.fastq(.*)trim.fastq.gz", + full.names = TRUE)) + +# SWR: Make a list of fnF and fnR for quality plotting that excludes very small +# files (default is < 100 bytes) which may cause problems in quality plotting +# and exit the chunk with error +min_file_size <- function(files, minsize = 100){ + info <- file.info(files) + index_bigs <- which(info$size > 100) + keep <- files[index_bigs] +} + +fnFL <- min_file_size(fnFs) +fnRL <- min_file_size(fnRs) + +#Extract sample names +sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1) + +# begin commands for analysing R01/R02 together + +if (analysis == "both") { + + #Create paths for the filtered files in a subdirectory called filtered/ + filtFs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_F_filt.fastq")) + filtRs <- file.path(path,marker, "R01_R02", "filtered", + paste0(sample.names, "_R_filt.fastq")) + + outp <- file.path(path, marker, "R01_R02") + + #Filter sequences + ##SWR: fixed paired end pairing by adding matchIDs=TRUE + out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, + maxEE = c(my_maxEEf,my_maxEEr), truncQ = my_truncQ, + minLen = my_minLen, trimLeft = left_trim, truncLen = my_truncLen, + rm.phix = TRUE, compress = TRUE, multithread = TRUE, + matchIDs = TRUE) + head(out) + write.table(out, file.path(path, marker, "R01_R02", "out.txt")) + + if (plotQC == "TRUE") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL),plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(c(fnRL),plotQualityProfile) + print(plr) + + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + sample_number <- if (length(fnRL) < sample_number) length(fnRL) else sample_number + + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl,nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + plr <- lapply(sample(c(fnRL), sample_number), plotQualityProfile) + print(plr) + mlr <- marrangeGrob(plr, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_R02", "qualityplots_rev.pdf"), + mlr, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # dereplicate + derepFs <- derepFastq(filtFs[out[,2] > 0], verbose = TRUE) + derepRs <- derepFastq(filtRs[out[,2] > 0], verbose = TRUE) + + # Name the derep-class objects by the sample names + names(derepFs) <- sample.names[out[,2] > 0] + names(derepRs) <- sample.names[out[,2] > 0] + + #Train dada2 to your dataset + errF <- learnErrors(filtFs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + errR <- learnErrors(filtRs[out[,2] > 0], multithread = TRUE, nbases = 1e+09) + + ## Plot the Estimated Error Rates for the Transition Types + #check that model and data match reasonably well + eFplot <- plotErrors(errF, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + eRplot <- plotErrors(errR, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_R02", "R2_error_profile.pdf"), + eRplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # sample inference + dadaFs <- dada(derepFs, err = errF, multithread = TRUE) + dadaRs <- dada(derepRs, err = errR, multithread = TRUE) + + # merge forward and reverse reads + mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, + minOverlap = my_minoverlap, verbose = TRUE) + + # Inspect the merger data.frame from the first sample + head(mergers[[1]]) + merge.tab <- data.frame(matrix(ncol = 10, nrow = 0)) + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", "prefer", + "accept") + for (i in (1:length(mergers))) { + if (nrow(mergers[[i]]) > 0) { + sub <- cbind(names(mergers)[i], mergers[[i]]) + merge.tab <- rbind(merge.tab,sub) + } else {} + } + colnames(merge.tab) <- c("sample", "sequence", "abundance", "forward", + "reverse", "nmatch", "nmismatch", "nindel", + "prefer", "accept") + + write.table(merge.tab, file.path(path, marker, "R01_R02", "mergers.txt"), + row.names = FALSE) + + + # make sequence table and distribution table for sequence lengths + seqtab <- makeSequenceTable(mergers) + # use only sequences longer than 50 bp as the rdp classifier can't accept less + # than that + seqtab <- seqtab[,nchar(colnames(seqtab)) > 49] + + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtab <- collapseNoMismatch(seqtab, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + + dim(seqtab) + table(nchar(getSequences(seqtab))) + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtab, file.path(path, marker, "R01_R02", "seqtab.rds")) + + + # remove chimeric sequences + seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtab.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtab.nochim)/sum(seqtab) + write.table(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab.nochim.txt")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtab.nochim, file.path(path, marker, + "R01_R02", "seqtab_nochim.rds")) + + # output table for each sample + # define a function that counts sequences in a file/object + getN <- function(x) sum(getUniques(x)) + if (min(out[,2]) > 0) { + track <- cbind(out, sapply(dadaFs, getN), + sapply(dadaRs, getN), + sapply(mergers, getN), + rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1]*100) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- sample.names + } else { + track <- cbind(out[out[,2] > 0,], sapply(dadaFs, getN), sapply(dadaRs, getN), + sapply(mergers, getN), ... = rowSums(seqtab.nochim), + rowSums(seqtab.nochim)/out[,1][out[,2] > 0]*100) + + # If processing a single sample, remove the sapply calls: + # e.g. replace sapply(dadaFs, getN) with getN(dadaFs) + track <- rbind(track, cbind(out[,1][out[,2] == 0], + out[,2][out[,2] == 0], + NA,NA,NA,NA,NA)) + colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", + "merged", "nonchim","percent_retained") + rownames(track) <- c(sample.names[out[,2] > 0], sample.names[out[,2] == 0]) + } + + head(track) + write.table(track, file.path(path, marker, "R01_R02", "track.txt")) + + if (assign_taxonomy == "TRUE") { + taxa <- assignTaxonomy(seqtab.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxa[[1]]) == colnames(seqtab.nochim)) == nrow(taxa[[1]]) + tmp <- cbind(rownames(taxa[[1]]),taxa[[1]],taxa[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + tmp <- data.frame(tmp) + + write.table(tmp, file.path(path, marker, + "R01_R02", "taxonomy_ASVs_NC_R1R2.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxa$tax, file.path(path, marker, "R01_R02", "taxa.rds")) + + tmp2 <- cbind(t(seqtab.nochim), tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_R02", "seqtab.nochim_withtax.txt")) + + # write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", 1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + + # write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtab.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_withtax.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtab.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV",1:length(colnames(seqtab.nochim))) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_R02", "ASVs_raw.fasta"), + format = "fasta") + } +} else { + + + ###################################################################################### + # Forward Reads Only + ################################################################################### + + # Create paths for the filtered files in a subdirectory called filtered/ + + filtFsR1 <- file.path(path, marker, "R01_only", "filtered", + paste0(sample.names, "_F_filt.fastq")) + + outp <- file.path(path, marker, "R01_only") + + # Filter sequences + outR1 <- filterAndTrim(fnFs, filtFsR1, + maxEE = my_maxEEf, truncQ = my_truncQ, minLen = my_minLen, + trimLeft = left_trim[1], rm.phix = TRUE, + compress = TRUE, multithread = TRUE) + head(outR1) + write.table(outR1, file.path(path, marker, "R01_only", "out.txt")) + + if (plotQC == "TRUE") { + #Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + pl <- lapply(c(fnFL), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2,ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + if (plotQC == "SUB") { + # Plot Forward/Reverse Quality scores for each sample (not strictly + # necessary, time intensive) + sample_number <- if (length(fnFL) < 10) length(fnFL) else 10 + pl <- lapply(sample(c(fnFL), sample_number), plotQualityProfile) + print(pl) + ml <- marrangeGrob(pl, nrow = 2, ncol = 1) + ggsave(file.path(path, marker, "R01_only", "qualityplots_fwd.pdf"), + ml, width = 20, height = 26, unit = "cm", dpi = 300) + } + + # Train dada2 to your dataset + errFR1 <- learnErrors(filtFsR1[outR1[,2] > 0], multithread = TRUE) + eFplot <- plotErrors(errFR1, nominalQ = TRUE) + ggsave(file.path(path, marker, "R01_only", "R1_error_profile.pdf"), + eFplot, width = 20, height = 26, unit = "cm", dpi = 300) + + # dereplicate + derepFsR1 <- derepFastq(filtFsR1[outR1[,2] > 0], verbose = TRUE) + # Name the derep-class objects by the sample names + names(derepFsR1) <- sample.names[outR1[,2] > 0] + + # sample inference + dadaFsR1 <- dada(derepFsR1, err = errFR1, multithread = TRUE) + + + # make sequence table and distribution table for sequence lengths + seqtabR1 <- makeSequenceTable(dadaFsR1) + ## SWR: If Collapse_overlapping=TRUE, collapses identical sequences together + # based on an overlap specified in 'Collapse_minOverlap'. + if (Collapse_overlapping == TRUE) { + seqtabR1 <- collapseNoMismatch(seqtabR1, minOverlap = Collapse_minOverlap, + orderBy = "abundance", identicalOnly = FALSE, + vec = TRUE, verbose = FALSE) + } + dim(seqtabR1) + table(nchar(getSequences(seqtabR1))) + # RDP can't assign taxonomy to sequences shorter than 50bp, so filter to > 50bp + seqtabR1 <- seqtabR1[,nchar(colnames(seqtabR1)) > 49] + + ## SWR: save RDS format seqtab to allow for merging of multiple runs and later + # analysis without rerunning the whole pipeline + saveRDS(seqtabR1, file.path(path, marker, "R01_only", "seqtab.rds")) + + # remove chimeric sequences + seqtabR1.nochim <- removeBimeraDenovo(seqtabR1, method = "consensus", + multithread = TRUE, verbose = TRUE) + dim(seqtabR1.nochim) + # calculate percent of sequences that are non-chimeric + sum(seqtabR1.nochim)/sum(seqtabR1) + write.table(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtabR1.nochim")) + + ## SWR: additional .RDS output table for easy downstream use in e.g. phyloseq + saveRDS(seqtabR1.nochim, file.path(path, marker, + "R01_only", "seqtab_nochim.rds")) + + # output table for each sample + getN <- function(x) sum(getUniques(x)) + if (min(outR1[,2]) > 0) { + trackR1 <- cbind(outR1, sapply(dadaFsR1,getN), rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1]*100) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names) + } else { + trackR1 <- cbind(outR1[outR1[,2] > 0,], sapply(dadaFsR1, getN), + rowSums(seqtabR1.nochim), + rowSums(seqtabR1.nochim)/outR1[,1][outR1[,2] > 0]*100) + trackR1 <- rbind(trackR1,cbind(outR1[,1][outR1[,2] == 0], + outR1[,2][outR1[,2] == 0], + NA,NA,NA)) + colnames(trackR1) <- c("input", "filtered", "denoisedF", + "nonchim","percent_retained") + rownames(trackR1) <- c(sample.names[!outR1[,2] == 0], + sample.names[outR1[,2] == 0]) + } + + head(trackR1) + write.table(trackR1, file.path(path, marker, "R01_only", "trackR1.txt")) + + if (assign_taxonomy == "TRUE") { + taxaR1 <- assignTaxonomy(seqtabR1.nochim, tax_database, multithread = TRUE, + minBoot = my_minBootstrap, outputBootstraps = TRUE) + sum(rownames(taxaR1[[1]]) == colnames(seqtabR1.nochim)) == nrow(taxaR1[[1]]) + tmp <- cbind(rownames(taxaR1[[1]]), taxaR1[[1]], taxaR1[[2]]) + rownames(tmp) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + tmp <- data.frame(tmp) + write.table(tmp, file.path(path, marker, + "R01_only", "taxonomy_ASVs_NC_R1.txt")) + + ## SWR: save RDS file omitting bootstrap values for downstream use in + # e.g phyloseq + saveRDS(taxaR1$tax, file.path(path, marker, "R01_only", "taxa.rds")) + + tmp2 <- cbind(t(seqtabR1.nochim),tmp) + rownames(tmp2) <- paste0("ASV", 1:length(colnames(seqtabR1.nochim))) + write.table(tmp2, file.path(path, marker, + "R01_only", "seqtabR1.nochim_withtax.txt")) + + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + + #write ASVs with taxonomy in the header to a fasta file + seqsnochim <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochim@ranges@NAMES <- paste0("ASV", + 1:length(colnames(seqtabR1.nochim)), + "|","|", + paste(tmp$Kingdom, tmp$Phylum, tmp$Class, + tmp$Order, tmp$Family, tmp$Genus, + tmp$Species, sep = ";")) + writeXStringSet(seqsnochim, file.path(path, marker, + "R01_only", "ASVs_withtax_R1.fasta"), + format = "fasta") + } else { + #write raw ASVs to a fasta file + seqsnochimR1 <- DNAStringSet(colnames(seqtabR1.nochim)) + seqsnochimR1@ranges@NAMES <- paste0("ASV",1:length( + colnames(seqtabR1.nochim))) + writeXStringSet(seqsnochimR1, file.path(path, marker, + "R01_only", "ASVs_raw.fasta"), + format = "fasta") + } +} + +#make paths available to subsequent chunks +dada2_dir <- outp +#If Linux, make 'dada2_folder'available globally. + if (os == "Linux") { + Sys.setenv(dada2Folder = dada2_dir) + } +``` \ No newline at end of file