Question

Minimum length of reads after trimming for Assembly

1

Entering edit mode

7.8 years ago

Ric ▴ 430

Hello, My Illumina paired-end reads (version 1.9) length various between 35-151. I noticed that the forward and reverse reads do not have the same length. Here are the Fastqc quality plots for R1 and R2.

Is the following Trimmomatic command optimal set to run for the above QC plots and used the trimmed reads for assembly?

java -jar /programs/trimmomatic/trimmomatic-0.32.jar PE -phred33 paired_end_reads_1.fastq paired_end_reads_2.fastq kept_paired_end_reads_1.fastq kept_paired_end_reads_2.fastq unpaired_1.fastq unpaired_2.fastq  SLIDINGWINDOW:4:15 MINLEN:65

Thank you in advance.

Mic

Assembly sequencing qc fastqc Trimmomatic • 6.4k views

ADD COMMENT • link 7.8 years ago by Ric ▴ 430

0

Entering edit mode

If you want optimal results, I suggest you start over with the raw data. Your reads have probably already been adapter-trimmed by Illumina's software, which tends to be mediocre, and is the reason the reads have different lengths. Since your reads are all 2x151bp, correct adapter-trimming will always leave R1 and R2 exactly the same length. As for quality-trimming, sliding-window-based trimming is also not optimal. There exists an optimal quality-trimming algorithm, which I'll call the "Phred algorithm", and it is implemented in seqtk and BBDuk.

The minimum length after trimming is entirely at your discretion. I'd recommend setting it at the kmer length you plan to use for assembly.

If you can obtain the raw reads, I suggest you trim with BBDuk using this command:

bbduk.sh in1=r1.fq in2=r2.fq out1=trimmed1.fq out2=trimmed2.fq ktrim=r k=23 mink=11 hdist=1 tpe tbo ref=adapters.fa qtrim=rl trimq=15

That will do both adapter and quality trimming. "adapters.fa" is included with the BBMap package and contains all public Illumina adapter sequences.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

How to determine the k-mer value for abyss or SPAdes?

ADD REPLY • link 7.8 years ago by Ric ▴ 430

0

Entering edit mode

Adapter trimming is almost always preferred, but don't apply quality trimming for assembly.

ADD REPLY • link 7.8 years ago by lh3 33k

0

Entering edit mode

I agree in principle, but depending on the genome size, data quality, and assembler, quality trimming can sometimes make the difference between generating an assembly and running out of memory and crashing.

For small genomes that fit in memory with no problem it's true that quality-trimming is unnecessary and can cause inferior assemblies. It depends on how the assembler processes quality scores, though.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

1

Entering edit mode

There are a few papers on this topic. I have also tried quality trimming myself. In all these cases, quality trimming hurts de novo assembly. That said, it is in theory possible that some combination of trimmer/assembler may produce better results.

ADD REPLY • link 7.8 years ago by lh3 33k

1

Entering edit mode

There are also papers that state the opposite. For example (from http://bioinformatics.oxfordjournals.org/content/30/19/2709.full ):

we observed that quality-based trimming of raw data gave ∼15-fold improvements in N50 statistics

Really depends on the data and the assembler. As with everything else with assembly-related, it seems the best strategy is "try a bunch of options and see what works best for you".

ADD REPLY • link 7.8 years ago by igor 13k

0

Entering edit mode

Thanks. Didn't know this paper. However, "15-fold" looks really suspicious. I wanted to know how this 15-fold was derived, but the paper gives me little context. On the trimming strategy, the paper cited a 2012 paper by the same group, where they only mentioned the CLC suite and FASTQC without details. The paper also cited the GAGE paper, but GAGE does not discuss trimming as I remember. In addition, the paper did not say which assembler is this sensitive to quality-based trimming, let alone detailed statistics. I don't know how much I should trust this paper.

ADD REPLY • link 7.8 years ago by lh3 33k

0

Entering edit mode

I agree it looks questionable. I didn't realize there were papers discussing trimming strategies (I don't have a lot of assembly experience, but the studies I saw generally focus on the assembly tools rather than pre-processing). After seeing your comment, I decided to investigate further and that just happened to be the first hit.

ADD REPLY • link 7.8 years ago by igor 13k

0

Entering edit mode

I would like to try Abbyss and SPAdes out to assemble the above reads. Is it a good idea to use FASTuniq to remove duplicates before assembly?

ADD REPLY • link 7.8 years ago by Ric ▴ 430

0

Entering edit mode

Probably not. Duplicate-removal is only useful for amplified libraries, and is mainly for variant-calling when resequencing. If your library was not amplified, do not remove duplicates.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k