I am trimming primers off of some .fastq files before processing them through the dada2
pipeline. dada2
wants primers already removed, but also wants quality scores. To trim primers I am using kmer trimming via bbduk
and the following command:
bbmap/bbduk.sh in=input.fastq out=output.fastq literal=GTGYCAGCMGCCGCGGTAA ktrim=l k=10
The head of the input file looks like this:
@ERR1873185.1 AV103___MISEQ:377:000000000-ABUYP:1:1101:15575:1896/1
GTGTCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGTGTAGGCGGGTCTTTTATTCAGGGGGGAAATGCCCAGGCTCAACCTTGGAACTGCCTTTGATACTGGAGATCTTGAGTCCGGGGGGGGTGAGTGGATTTTCGAGTGTAGGGGTGAAATTCGTAGATATTCGCAAGAAAACCAGTGGCGAAGGCGGCTCCCTGGCCCGGGACTGACGCTGGGACGCGAAAGCGGGGGGGGCAACGAGGCTTAGCACCCCCTGTGG
+
4FFGGGFFAFGH?EE2BEGHHGHG2EGGGG?EGHF?FA0FHFGGGCFE4FFGF1?/EEDHHHDE/>/EFD3B@////0<11111<<1?F@@--;A9/B99:AE?FFEBF.//9///BF//99B//9999;//9;/9/:B////;FD?B----@-9../B.B/;B/;9.:/9//;/;..-;BB/;;.99..//99/.9..-.//.;.../:;.-;-9..@@>-9.;./.;A9B---;A./9....-;A.9@=;--./.-;==----;A.-..--.//////;:---.//.
@ERR1873185.2 AV103___MISEQ:377:000000000-ABUYP:1:1101:17496:1920/1
GTGCCAGCCGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCCCGTAGGCGGGCTGTCAAGTCTGGGGTGAAATCCCGCGGCTCAACCGTGGAACTGCCTTAGATACTGACGGCCTCGAGGGAGGTAGGGGCGAGCGGAACTGTGAGTGGAGCGGTGAAATGCGTTGATATTCACAGGAACTCCGGTGGCGAAGGCGGCTCGCTGGCCCTCTTCTGACGCTGATGCGCGAAAGCTAGGGGAGCAAACGGGATTAGATACCCTGGTAG
+
EG5FGG4FFEEG2E?2E?FDBFEEGEGEFHEEE0EEHF1FGHFCEEFHGHH4B1FEFBBGGH?F/>/EEAEBECGGGGHHBBBGHHFFHCGCHHHHHHHFGG?D@FHGHGBAEGGGGGGGGGG0BCBFGFGBFD@BBFFFAA->.A-.//.AD?BB-;??9BBFBBB/;FFFFFFB<.;9FFF/BD..99BFFFFBFBBEFFFFFFFBBB.9;>B??A>;>9A--..-.AEFFFFFFF/B?DFD;FFF;.9-@-ABF/F???A-:;FFFF...;B//;//9F99BEFF:
The head of the output file looks like this:
>ERR1873185.3001 AV103___MISEQ:377:000000000-ABUYP:1:1101:6839:18042/1
TACGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGCGCACGTAGGCGGCTTTTTAAGTCAGGG
GTGAAATCCTGGAGCTCAACTCCAGAACTGCCTTTGATACTGGGAAGCTTGAGTCCGGGAGAGGTGAGTG
GAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCTCACTGGC
CCGGTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTAGTAG
>ERR1873185.3002 AV103___MISEQ:377:000000000-ABUYP:1:1101:13081:18045/1
CACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGAGTAAAGAGCTCGTAGGCGGTCCGTCACGTCTGTT
GTGAAAATCCAGGGCTCAACCCTGGACTTGCTGCGGATACGGGCGGACTAGAGGTAGGTAGGGGAGAATG
GAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGG
CCTTACCTGCCGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGAAACCCGTGTAG
how can I tell bbduk
to just trim the primer sequence and retain the quality scores? Furthermore, The output file sequences are in a different order, anyway to have them stay in the same order? I'm worried my current bbduk
call is also quality filtering/discarding which I do not want. Its difficult to tell from the file line counts because the sequences in the output of bbduk
take up multiple lines in the output, rather than a single line.
This helps! Yes, my output line originally was
seqs.fna
, I changed this when i was shortening the filepaths for this example, I didn't realizebbduk
would use that as input to the function.Where can I find that nuance in the
bbduk
documentation? I don't see it here: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/It is up some items: Usage Guide.
Also-
bbduk
can handle other degenerate base names thanN
. Frombbduk
documentation:One last question- the number of lines in the output file is fewer than the number of lines in the input file, implying some sequences have been removed. Specifically what are the criteria for removing a sequence from the output? As far as I can understand from the documentation this command trims primers, I have no idea under what conditions it removes sequences- if the k-mer sequence is not found within a read? something else?
If you had a primer dimer in the read then that would be almost certainly be thrown out. Other possibility is
minlen=10
by default. If your read becomes shorter 10 bp after it gets trimmed then it will be removed. If a k-mer is not found in the read nothing should happen to it.