Is it possible/recommended to trim RNA-Seq reads to a specific length
1
0
Entering edit mode
7.1 years ago
komal.rathi ★ 4.1k

Hi everyone,

I have paired end RNA-sequencing samples where the mates in the two paired files are of unequal lengths:

For e.g.:

Original reads

R1:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGCTNNCCN
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA/##6E#

R2:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE//

Using trim_galore hasn't made any difference:

R1:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGC
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA

R2:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE

This is another sample:

Original file R1:

@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:1659:1064 1:N:0:GTTGTA
CAGGGTTGGAAGAGCTGGCCTCGCCTTTCGGCTCCTTTCTCGTCTTGGCCGCGCCGCGGCGTAGGTCCAGCTTGAGCTGCTGGTTCTGCTGGAGCAGGGTG
+
AAAAAEEEEEEAEEEEEEEEEEE<EAEEEEEAE/EEEEAEEAEEEE/EEEEEA//EE<EAEA//EEEAEEE/E<//</A6E<EEE<EE6AAEAE6<AEEE/
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGTG
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA/
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC
+
AAAAAAEEEAAEEEEEEEEE<EEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEAEE<AAEEEAEE

First few reads after trimming R1:

@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGT
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC

Looking at the distribution of the read lengths, majority of them are 100 bp long.

My goal is to retrieve fusions from the RNA-Seq data. I am able to run STAR-Fusion on this despite of the unequal mate lengths but I am unable to run chimeraScan because of this exact reason.

Is it possible to trim the reads in such a way as to create mates of equal lengths using a trimming tool? More importantly, would that approach be recommended?

Thanks!

trim RNA-Seq chimeraScan STAR-Fusion • 2.0k views
ADD COMMENT
0
Entering edit mode

Interesting question. What if instead of trimming you can add Ns?

ADD REPLY
0
Entering edit mode

komal.rathi : Hopefully not all of your R2 read data looks like that (I assume these are just the first few reads). How and what was done to this data to get them in this state (on sequencer trimming?) Are those N's a result of masking the adapter? If R1 reads are indeed trimmed then you may have short inserts in this data.

ADD REPLY
0
Entering edit mode

I have edited my question to reflect that I had trimmed the reads using trim_galore.

ADD REPLY
0
Entering edit mode

RNA-sequencing samples where the mates in the two paired files are of unequal lengths

I have a suspicion that this data is pre-trimmed (on sequencer/BaseSpace) which is why you have unequal length reads. If majority/all of your R2 reads have N's (>50% of the read) like that then this appears to be pretty bad data (unless the bases have been deliberately masked). Not sure if it can be used/trusted to find fusions.

ADD REPLY
0
Entering edit mode

Yeah I guess this question needs more information than I have put - I need to talk to the biologists who generated this data. I will clarify some things and add the details in the question.

ADD REPLY
0
Entering edit mode
7.1 years ago
mforde84 ★ 1.4k

fastax-trimmer - http://hannonlab.cshl.edu/fastx_toolkit/

Will get the job done. Why not give it a try? If you want to test it out, run a alignment and differential expression analysis with the regular and trimmed reads, and see how well they compare. Also the 5' of the read in RNAseq is more noisey than the 3'. So if you uniformly trim from 5' it may actually improve alignments, but probably not very much.

ADD COMMENT

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6