Which Phred value to use in trimming
2
0
Entering edit mode
9.3 years ago
Zammy ▴ 20

Hello everyone

I have performed quality analysis of RAW miRNA-seq data using FastQC. As a result, I got the following output and the encoding according to FastQC appears to be illumina v1.5 encoding. I next want to perform adaptor removal and trimming of low quality reads. Based on this, could anyone please tell me which is the Phred score value to use in trimming step in tools such as fastx toolkit or cutadapt or any other better tool?

Regards
Zammy

microrna next-generation-sequencing • 12k views
ADD COMMENT
3
Entering edit mode
9.3 years ago

The best tools for quality-trimming are BBDuk and seqtk, which use the optimal Phred algorithm. Also, they're both fast. But whether you should quality-trim, and what the threshold should be, depends on your intended use of the data and the data quality (which in your case looks quite high). Often a level of ~10 is pretty good for maximizing things like alignment rates. I do not recommend fastx toolkit, particularly if you are using paired reads.

BBDuk can additionally do adapter-trimming at the same time. Unlike quality-trimming, which is sometimes good and sometimes unnecessary, adapter-trimming is almost universally a good idea.

ADD COMMENT
1
Entering edit mode

Thank you for your reply Brian

I have seen another post where you suggested BBDuk. Well it makes sense as you are the developer of it and the comparative analysis you performed between different tools also favors BBDuk :). I don't have access to linux machine right now so I hope it will run on windows 7. Could you please tell me how to run BBDuk in windows as it is stated on the download page that it is platform independent. The intended use of data is to perform differential miRNA expression analysis between two groups. So the next step after adapter removal is to perform mapping to hg19 and miRBase. Its not paired-end data

For the sake of my understanding, could you please explain if I have to define Phred score in trimming step, what would that be according to above figures? This is what I am trying to figure out. Someone suggested me to use 33 but I am not sure how exactly it works as in the logic behind.

Regards

ADD REPLY
1
Entering edit mode

Zammy,

The ASCII offset for modern Illumina reads is 33, meaning a base with Q0 will have ASCII code 33 (!) and are generally Ns; there is no lower quality score (except in some of Illumina's old formats when they used negative values, just to be different). A normal Illumina run currently produces Q0 for N, Q2 for bases that the machine thinks are junk, and 5-41 for real qualities of real bases. Trimming to Q10 means you discard bases that the machine thinks have a greater than 10% chance of being wrong; Q20 means greater than 1%; and Q30 means greater than 0.1%. High levels throw away useful data, and low levels retain more errors. If you are doing something highly tolerant of errors, quality-trimming (beyond, perhaps, Q1) is not needed or necessarily useful. For assembly, trimming to higher values like 10 often is useful.

To run BBDuk in Windows:

Unzip it and untar it to somewhere; let's say C:\bbmap\. You can unzip and untar it with 7-zip, my favorite Windows compression utility.

Then run:

java -ea \
  -Xmx1g \
  -cp C:\bbmap\current\ jgi.BBDukF \
  in=reads.fq out=trimmed.fq \
  ref=C:\bbmap\resources\truseq.fa.gz,C:\bbmap\resources\nextera.fa.gz \
  qtrim=rl \
  trimq=6 \
  ktrim=r \
  k=23 \
  mink=11 \
  hdist=1 \
  tbo \
  tpe

That will do quality and adapter trimming. You can adjust the trimq to a different value depending on the stringency; for mapping, Q6 should be fine. If you have paired end reads in two files, add in2= and out2= with additional file names. BBMap also runs in Windows, by the way, though it needs 24GB for hg19.

Oh, and I certainly encourage you to try other trimmers, and post back here if they did a better job :)

ADD REPLY
0
Entering edit mode

Thanks, I am gonna try it now...My apologies but I am not very expert with command line things. The command you have written above is supposed to be run in cmd or windows command command prompt, correct? Furthermore, is there a manual or -h command to explain the script ? in and out are making sense but others such as ref, hdist etc and how to give input of my adapter sequence.

Regards

ADD REPLY
0
Entering edit mode

bbduk.sh -h will work in Linux, but not in Windows. You can open bbduk.sh (which would be in C:\bbmap\) and view it with a normal editor, though, like Wordpad or Notepad++ (NOT Notepad which cannot handle newlines correctly); it explains the definitions of all the flags in plain text.

hdist=1 means "allow 1 mismatch"; hdist is short for "Hamming distance". ref is the adapter files; the ones I listed will work if you used normal Illumina TrueSeq or Nextera adapters, but if you have custom adapters, then point it to those instead. They can either be in a fasta-formatted file, or just the literal adapter sequence if you use the literal flag, e.g. literal=ACCGTGTCCAGTATTGGTCAGGCCT.

ref means "reference sequence" (in this case, adapters); qtrim=rl means quality-trim the right and left sides; trimq=6 means trim regions with average quality below Q6; ktrim=r means "look for adapter kmers, and when found, trim to the right"; k=23 means "try to match 23-mers between the read and reference"; mink=11 means "use kmers as short as 11 at the end of the read"; tbo and tpe allow better adapter trimming of reads that are paired and will have no effect on single-ended reads.

As for how you run it - you can go to start->run-> and type cmd then hit enter, which will give you a command prompt in windows. In Windows 7 "run" is the little text box at the bottom of the start menu. Then use cd to navigate to the appropriate location; e.g. cd C:\data\ if your reads are there.

Please note that BBTools require Java which is free but not preinstalled on Windows, so if you don't have it, you'll need to download and install from here.

ADD REPLY
0
Entering edit mode

Following is the output after running above command. Trying to figure out if its okay.....

C:\>java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=S-01_C.fq out=trimmed.fq
 literal=TCGTATGCCGTCTTCTGCTTG qtrim=rl trimq=6 ktrim=r k=23 mink=11 hdist=0 tbo
 tpe
Executing jgi.BBDukF [in=S-01_C.fq, out=trimmed.fq, literal=TCGTATGCCGTCTTCTGCTT
G, qtrim=rl, trimq=6, ktrim=r, k=23, mink=11, hdist=0, tbo, tpe]

maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=178m, used=13m

Added 0 kmers; time:    0.008 seconds.
Memory: free=172m, used=19m

Changed from ASCII-33 to ASCII-64 on input quality 95 while prescanning.
Input is being processed as unpaired
Started output streams: 0.024 seconds.
Processing time:                9.609 seconds.

Input:                          10675301 reads          523089749 bases.
QTrimmed:                       256092 reads (2.40%)    2932249 bases (0.56%)
KTrimmed:                       0 reads (0.00%)         0 bases (0.00%)
Trimmed by overlap:             0 reads (0.00%)         0 bases (0.00%)
Result:                         10657189 reads (99.83%)         520157500 bases
(99.44%)

Time:                           9.668 seconds.
Reads Processed:      10675k    1104.23k reads/sec
Bases Processed:        523m    54.11m bases/sec
ADD REPLY
0
Entering edit mode

Since your adapter sequence is only 20bp, you need to set "k=20" or lower - otherwise, there cannot possibly be any kmer matches.

ADD REPLY
0
Entering edit mode

I was wondering is it possible that FastQC incorrectly detects encoding method? and is the sequence of adapter also relates to the encoding version? for example:

The illumina v1.0 small RNA 3p adapter "TCGTATGCCGTCTTCTGCTTG" and the

The illumina v1.5 small RNA 3p adapter "ATCTCGTATGCCGTCTTCTGCTTG"

Now in my case, FastQC tells me that encoding method is 1.5 whereas the adapter sequence in my RAW dataset is of v1.0. I also checked this via manual Find option in Kate text editor that is showing the v1.0 adapter at the end of my reads. However, there are 44717 reads which have 3' sequence matching to v1.5 adapter in the sequence file. Attached is the screenshot of first few lines of RAW file. So what could be the actual encoding method here and again how it relates to the selection of quality value

< image not found >

ADD REPLY
0
Entering edit mode

The encoding method is talking about quality encoding, not adapter sequences.

ADD REPLY
1
Entering edit mode
9.3 years ago
rtliu ★ 2.2k

The recent paper suggests that "a more gentle trimming, specifically of those nucleotides whose PHRED score <2 or <5, is optimal for most studies across a wide variety of metrics."

ADD COMMENT
0
Entering edit mode

As I said before, there is no optimal Q-trim threshold for a wide variety of experiments. That particular paper used very granular metrics - 0,2,5,10,20 - which is rather lazy; I would never pass a paper without data from all cutoffs, at least in the low range. Could there be a value between 5 and 10 that is optimal? From that paper alone, nobody will ever know. And there is a huge difference between Q5 and Q10 when trimming - over a factor of 3 in error rate.

Most importantly - it's trivial to design a custom dataset that will be negatively impacted by quality-trimming, or any other processing. For quality-trimming, all you need to do is have barely sufficient coverage to assemble; in that case, any trimming or filtering will give inferior results. When you have sufficient coverage, quality-trimming is often useful (depending on the experiment). When you do not, it is more likely to be detrimental. The author of that paper explicitly chose insufficient coverage for any normal assembly or analysis - his ranges were from 0.33x to 5x, which are ridiculous. This number is for whole-genome coverage rather than transcriptome coverage (which might be ~20x greater on average), but transcriptome coverage cannot be exactly quantified; anyway, transcript expression is generally follows an exponential distribution. Thus, low-expression transcripts will not be present in a transcriptome assembly; and the lower your coverage, the fewer low-expression transcripts will be correctly assembled. In real life, does anyone use less than 1/3rd of a HiSeq lane to assemble mammalian transcriptomes? I hope not! But if they did, their papers would never pass peer-review if I was a reviewer.

In other words - this paper could not have been designed better to demonstrate that quality-trimming is detrimental. And the conclusion is absolutely correct - when you assemble RNA-seq reads with extremely low coverage, you will typically get more assembled bases with lower quality thresholds. That should not ever be used as evidence that quality-trimming, even aggressive quality-trimming, is generally bad - because it has no relevance to uses such as DNA assembly, which are more common.

In summary - this paper, and its conclusions, are not generally applicable and should not be used outside of discussions regarding transcriptome assembly with ridiculously low amounts of data.

I believe that quality-trimming is very situational; for RNA-seq mapping, for example, BBMap will typically perform better, particularly for quantification of alternate isoforms expression, if no quality-trimming is done. Bases that Illumina calls Q2 are generally over Q12, in my empirical tests (and then Q5 bases are much less accurate than Q2 bases), so trimming them is often detrimental to optimal mapping. Generally, when using an aligner robust to errors, quality-trimming tends to be unwise. But this paper is being cited too often as conclusive evidence that quality-trimming is bad for any purpose. These conclusions are wrong.

ADD REPLY

Login before adding your answer.

Traffic: 2481 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6