Question

Generate an amplicon sequence variant table without quality data using dada2 or similar in R

1

Entering edit mode

6.0 years ago

caverill ▴ 40

There are sequences I download from MG-RAST that already have primers removed, quality filtering performed, and forward-reverse reads paired. Each file is a fasta file of sequences. Here is some code to get an example sequence file in R:

cur.dir <- 'path/to/test_dir/'
mgm.id <- 'mgm4788103.3'
mgrast.link <- paste0('http://api.metagenomics.anl.gov/1/download/',mgm4788103.3,'?file=050.1')
cmd <- paste0('curl ',mgrast.link,' > ',cur.dir,'/',mgm.id,'.fasta')
system(cmd)

I would like to turn this fasta file into a dereplicated amplicon sequence variant (ASV) table, where the rownames are the sample names (in this case there is only one sample, mgm4788103.3), the column names are the sequences of the unique sequences in the .fasta file, and the entries in the table are the number of each unique sequence observed in each sample. This is the output of the makeSequenceTable function of the dada2 package, a tutorial can be found here. Unfortunately, generating this table in dada2 requires me to start from unpaired reads that have not been quality filtered, which I don't have.

How can I generate this table from this .fasta file, or a directory of similar .fasta files?

R dada2 amplicon sequence data • 2.2k views

ADD COMMENT • link 6.0 years ago by caverill ▴ 40