I have paired end illumina miseq 16S amplicon reads. There are forward and reverse reads in separate files, _1.fastq
, _2.fastq
. Barcodes and adapters have already been removed. Original authors multiplexed 16S reads with other primers, so my file contains 16S sequences, as well as a bunch of non-target sequences. I would like to:
- Subset to 16S sequences only.
- Remove 16S primers.
The SRR number for an example sample is SRR7204977, which can be found here.
I usually use bbduk to handle this type of problem, however I am getting a strange outcome. Code below:
/path/to/bbmap/bbduk.sh literal=GGGTTNCGNTCGTTG ktrim=l k=10 in1=fastq1.path in2=fastq2.path out1=output1.path out2=output2.path ordered=t
This keeps returning the error:
****** WARNING! A KMER OPERATION WAS CHOSEN BUT NO KMERS WERE LOADED. ******
****** PLEASE ENSURE K IS LESS THAN OR EQUAL TO REF SEQUENCE LENGTHS. ******
Exception in thread "main" java.lang.AssertionError: You can bypass this assertion with the -da flag.
at jgi.BBDukF.process2(BBDukF.java:1007)
at jgi.BBDukF.process(BBDukF.java:924)
at jgi.BBDukF.main(BBDukF.java:70)
I have verified that many sequences in fastq1.path do indeed have this primer sequence. The smallest read length in the sequence file is 35bp, largest 301.
Any suggestions on how to (1) filter to only reads with this primer present and (2) remove the primer would be greatly appreciated!
I notice that you are using N's in the literal directive. You may want to turn this directive on
Wow. I knew bbduk could handle
N
characters, but I guess I've always passed on all possible combinations "manually". Thanks so much.A follow up question, using a simple grep on just the basepair lines, I know that 16,895 sequences in this file match the primer, out of 71,171. However, only 5,554 sequences are dropped. These are mostly super low quality and short reads. Any thoughts on how I can only retain reads where the forward read (
_1.fastq
) matches the primer?Make the
maxlength=
parameter more stringent and see if that helps.bbduk
by default runs in filter mode so if you useoutm=
(withoutktrim= or qtrim=
options) it should capture reads from R1 that have the primer. Then you can runrepair.sh
with resulting file to fish out reads that match from R2 file.From the bbduk guide: there is certainly a way to do this, however I was wondering if it was possible to combine it into one line: