Hello,
I am trying to learn what my options are if I would like to trim polyA from my reads. (FastqC repo see previous discussion: https://www.biostars.org/p/302411/#302507) We are quite certain we need to trim adapter, which we would use bbduk, but we also see polyA showing in many reads, and would like to trim that too.
Since polyA shows in different reads with different length, it does not make sense to trim by length. I hope to trim all the way till the nucleotide is not A. Any suggestions?
(Or maybe I should just go ahead with alignment (Tophat) and let the aligner ignore polyA in the reads?)
Thank you very much!
You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.
Thank you so much! I was wondering about that recently lol and this confirms it. Appreciate it!
By the way, in your experience, do you think these aligners can handle 10-20bps polyA in the reads?
I usually trim my reads, but I don't know if it's necessary. Haven't evaluated that.
You can use
bbduk.sh
to remove poly-A tails as well. withliteral=AAAAA
and adjusting value ofk=
as needed to something small.Thank you very much for following up with my questions. Can this be done in one command, together with trimming the adaptor? I assume not - because k for adaptor will be around 20, while this one should be something like 2-3 right?
You can use unix pipes for multiple
bbduk.sh
passes. Something like (example for SE reads, adjust for PE reads with in1= in2= etc):Thank you. One more question - For adaptor trimming, is there a particular reason not to just do force-trimming? (Simply trim # number of bps from one end)
You could (
forcetrimright=
orforcetrimleft=
) but the the problem is your adapter contamination may not always be in the same spot (unless you are doing some special library prep and expect it to be so).I thought the adaptor is always showing up at the left of the reads... why they will not be at the same spot please? Sorry for all these questions, but I really want to understand it better. Thank you.
Adapter should always be towards the right end of the read. See this document for clarification on how the libraries are constructed. Since your insert size varies all fragments will not have the adapter in the same location (assuming you have short inserts and have run of real DNA to sequence).