Question

error running bbduk to trim primers

0

Entering edit mode

5.3 years ago

caverill ▴ 40

I have paired end illumina miseq 16S amplicon reads. There are forward and reverse reads in separate files, _1.fastq, _2.fastq. Barcodes and adapters have already been removed. Original authors multiplexed 16S reads with other primers, so my file contains 16S sequences, as well as a bunch of non-target sequences. I would like to:

Subset to 16S sequences only.
Remove 16S primers.

The SRR number for an example sample is SRR7204977, which can be found here.

I usually use bbduk to handle this type of problem, however I am getting a strange outcome. Code below:

/path/to/bbmap/bbduk.sh literal=GGGTTNCGNTCGTTG ktrim=l k=10  in1=fastq1.path in2=fastq2.path out1=output1.path out2=output2.path ordered=t

This keeps returning the error:

******  WARNING! A KMER OPERATION WAS CHOSEN BUT NO KMERS WERE LOADED.  ******
******  PLEASE ENSURE K IS LESS THAN OR EQUAL TO REF SEQUENCE LENGTHS.  ******

Exception in thread "main" java.lang.AssertionError: You can bypass this assertion with the -da flag.
at jgi.BBDukF.process2(BBDukF.java:1007)
at jgi.BBDukF.process(BBDukF.java:924)
at jgi.BBDukF.main(BBDukF.java:70)

I have verified that many sequences in fastq1.path do indeed have this primer sequence. The smallest read length in the sequence file is 35bp, largest 301.

Any suggestions on how to (1) filter to only reads with this primer present and (2) remove the primer would be greatly appreciated!

bbduk primer trimming • 2.7k views

ADD COMMENT • link 5.3 years ago by caverill ▴ 40

1

Entering edit mode

I notice that you are using N's in the literal directive. You may want to turn this directive on

copyundefined=f     (cu) Process non-AGCT IUPAC reference bases by making all
                    possible unambiguous copies.  Intended for short motifs
                    or adapter barcodes, as time/memory use is exponential.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Wow. I knew bbduk could handle N characters, but I guess I've always passed on all possible combinations "manually". Thanks so much.

A follow up question, using a simple grep on just the basepair lines, I know that 16,895 sequences in this file match the primer, out of 71,171. However, only 5,554 sequences are dropped. These are mostly super low quality and short reads. Any thoughts on how I can only retain reads where the forward read (_1.fastq) matches the primer?

ADD REPLY • link 5.3 years ago by caverill ▴ 40

1

Entering edit mode

However, only 5,554 sequences are dropped.

Make the maxlength= parameter more stringent and see if that helps.

Any thoughts on how I can only retain reads where the forward read (_1.fastq) matches the primer?

bbduk by default runs in filter mode so if you use outm= (without ktrim= or qtrim= options) it should capture reads from R1 that have the primer. Then you can run repair.sh with resulting file to fish out reads that match from R2 file.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

From the bbduk guide: there is certainly a way to do this, however I was wondering if it was possible to combine it into one line:

Matching degenerate sequences such as primers: bbduk.sh in=reads.fq out=matching.fq literal=ACGTTNNNNNGTC copyundefined k=13 mm=f

This will clone the reference sequences to represent every possibility for the degenerate bases (Ns and other non-ACGT IUPAC symbols). For example, this would create ACGTTAAAAAGTC, ACGTTAAAACGTC, ACGTTAAAAGGTC, and so forth (all 1024 possibilities). If you are interested in seaching for new life by mining shotgun metagenomic reads for 16s sequences that do not quite match your primers… this (and hdist) might be a good place to start! But it’s also useful for adapters with barcodes.

ADD REPLY • link 5.3 years ago by caverill ▴ 40