Question

Trim Illumina PE data with trimmomatic

0

Entering edit mode

8.7 years ago

mschmid ▴ 180

I have Illumina MiSeq PE 2X250 data

I got the following info from the sequencing (some irrelevant or private data omitted):

[Header]
IEMFileVersion,4
Investigator Name,xxxx
Experiment Name,xxxxxx
Date,x/x/xxxx
Workflow,Assembly
Application,Assembly
Assay,TruSeq HT
Description,
Chemistry,Amplicon

[Reads]
301
301

[Settings]
ReverseComplement,0
kmer,31
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,GenomeFolder,Sample_Project,Description
1,AB1,,,D701,ATTACTCG,D501,TATAGCCT,,,
2,AB2,,,D701,ATTACTCG,D502,ATAGAGGC,,,
3,AB3,,,D701,ATTACTCG,D503,CCTATCCT,,,
4,AB4,,,D701,ATTACTCG,D504,GGCTCTGA,,,
5,AB5,,,D701,ATTACTCG,D505,AGGCGAAG,,,
6,AB6,,,D701,ATTACTCG,D506,TAATCTTA,,,
7,AB7,,,D701,ATTACTCG,D507,CAGGACGT,,,

I want to trim with timmomatic and then assemble with Spades (and maybe other assemblers). I do not have the data yet. So I do not know if the adapters and indexes are trimmed away. But let's assume they are still there.

Now some questions:

Is it necessary to add the indexes to the adapter file of trimmomatic (so that they get removed as well)? Or do the short sequences in general not interfere with assembly? Or could it even happen that trimmomatic finds "false positives", because those sequences are so short?
Even if the adapters and indexes would have been trimmed away would it still be necessary to provide the sequences because of read-throughs? (For example R1 reading through adaptor of R2)
How do I have to configure the adapter file of trimmomatic? For which sequences do I have to provide the reverse-complement? And what about the /1 and /2 option of trimmomatic? I did not really understand.

trimming sequencing paired-end illumina • 5.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by mschmid ▴ 180

Ram · Answer 1 · 2015-08-20

3

Entering edit mode

8.7 years ago

Brian Bushnell 20k

I like to include the indexes, for maximal sensitivity, but it's not strictly necessary in general. BBDuk is distributed with a file "adapters.fa" that includes all the standard Illumina Truseq and Nextera adapter sequences, including the indexes. Adapter trimming prior to assembly is always a good idea, as it increases speed, reduces memory consumption, and can lead to a better assembly (especially with RNA-seq, single cell, or metagenomes, which have highly variable coverage).
No, that's what trimming removes.
I'm not sure about Trimmomatic, but BBDuk looks for both the forward sequence and reverse-complement by default. For paired reads I recommend running it like this:
```
bbduk.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe
```

Though as long as the names are in the proper format (identical except for the number 1 and 2) it's shorter to do this:

bbduk.sh in=read#.fq out=trimmed#.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brain, this is an old post, I'm not sure if you can still see this, but I'm going to try my luck! I wanted to know what would happen if I omit k=23 mink=11 parameters in PE adapter trimming process, so I tried both:

with both parameters

Added 216529 kmers; time:   0.399 seconds.
Memory: max=47667m, free=43439m, used=4228m
Input is being processed as paired
Started output streams: 2.320 seconds.
Processing time:        88.703 seconds.
Input:                      29908174 reads      4516134274 bases.
KTrimmed:                   6290478 reads (21.03%)  386234224 bases (8.55%)
Trimmed by overlap:         648240 reads (2.17%)    4720504 bases (0.10%)
Total Removed:              6396 reads (0.02%)  390954728 bases (8.66%)
Result:                     29901778 reads (99.98%)     4125179546 bases (91.34%)
Time:               91.444 seconds.
Reads Processed:      29908k    327.06k reads/sec
Bases Processed:       4516m    49.39m bases/sec

without the parameters

Added 229571 kmers; time:   0.143 seconds.
Memory: max=47941m, free=43188m, used=4753m
Input is being processed as paired
Started output streams: 0.015 seconds.
Processing time:        89.504 seconds.
Input:                      29908174 reads      4516134274 bases.
KTrimmed:                   5237954 reads (17.51%)  366522888 bases (8.12%)
Trimmed by overlap:         1650678 reads (5.52%)   23351468 bases (0.52%)
Total Removed:              5208 reads (0.02%)  389874356 bases (8.63%)
Result:                     29902966 reads (99.98%)     4126259918 bases (91.37%)
Time:               89.684 seconds.
Reads Processed:      29908k    333.48k reads/sec
Bases Processed:       4516m    50.36m bases/sec

It seems like with the parameters more bases get trimmed off. Let's say at this point I only want to remove adapter sequence, is it safe to leave those parameters out?

Thank you!

Yang

ADD REPLY • link 6.7 years ago by zhangya3 • 0

1

Entering edit mode

Hi Yang,

If you don't specify "k=23 mink=11" it will use the default of "k=27" and no mink. "k=23 mink=11" is more sensitive, so I recommend that. Since you have PE reads, most of the time, even when adapters are not detected from kmers, they will still be detected based on overlap, which is why the overall amount of data removed (8.66% of bases compared to 8.63%) only changed very slightly.

But I do recommend keeping "k=23 mink=11" because the reads that don't get trimmed when you leave those parameters off almost certainly do have adapters, that should get trimmed. In other words, the 0.03% additional bases you gain by removing those flags are virtually all adapter sequence that you don't want.

ADD REPLY • link 6.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brain,

Thank you for the explanation, very helpful.

There is something else I'm confused about. The workflow I tried was:

bbmerge-auto.sh in1=read1.fastq in2=read2.fastq out=merged.fastq outu=unmerged.fastq ihist=ihist.txt ecct extend2=20 iteration=5 to merge non-overlapping PE reads (read is 151 bp, fragment size is larger than 400), bbmerge.sh in=merged.fastq outa=adapters.fa to discover adapter sequences, at the end use bbduk.sh to filter low quality reads. What I found was after bbmerge-auto.sh, I couldn't detect any adapter sequences in the merged file (instead, they were all in unmerged file).

the total size of read 1 and read 2 are 11 Gb, after merging, the merged file is 5.1 Gb. Is this 5.9 Gb size reduction expected? what does that say about my data?
when I moved on using bbduk, nothing got trimmed/filtered if I set Q=10, should I increase Q to 20? What is the difference between quality trimming and quality filtering?

Look forward hearing back from you!

Yang

ADD REPLY • link 6.7 years ago by zhangya3 • 0