Question

Strange mapping on miRNAs from RNA-seq library

0

Entering edit mode

8.7 years ago

manekineko ▴ 150

Hi,
I'm observing strange mapping on the miRNA precursors from RNA-seq PE reads library. I have checked the library and almost 100% of the reads are 100nt. So probably bowtie/tophat make some trimming and mapping here (I made a mapping only with bowtie (by Galaxy - option end-to-end fast) the result is picture below. Strangely I observed this only on miRNAs. The mapped fragments are 22-24nt. (PS. This is an general TrueSeq RNA-seq library 100 PE). I'm confused what it can be?

miRNA RNA-seq • 2.6k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by manekineko ▴ 150

Ram · Answer 1 · 2015-08-05

0

Entering edit mode

8.7 years ago

andrew.j.skelton73 6.5k

There seems to be higher coverage to only a fraction of the hairpin sequence, which might imply that what's aligning is a 3p or 5p mature sequence possibly?

ADD COMMENT • link 8.7 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Yes, but the question is how, as there is no small RNAs in the sample library....and no <30nt reads in the file....

ADD REPLY • link 8.7 years ago by manekineko ▴ 150

0

Entering edit mode

I assume these are clipped alignments. If you hover over one of these reads it will display the cigar string... I suggest you post that here. But more importantly, I recommend you adapter-trim your reads prior to mapping, so they don't need to be clipped.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

They are adapter-trimmed:

BBDuk version 35.14
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=33585m, free=32358m, used=1227m

Added 250901 kmers; time:     0.723 seconds.
Memory: max=33585m, free=30431m, used=3154m

Input is being processed as paired
Started output streams:    0.100 seconds.
Processing time:           407.442 seconds.

Input:                      91156618 reads         9206818418 bases.
KTrimmed:                   717213 reads (0.79%)     39491028 bases (0.43%)
Result:                     91043132 reads (99.88%)     9167327390 bases (99.57%)

Time:               408.319 seconds.
Reads Processed:      91156k     223.25k reads/sec
Bases Processed:       9206m     22.55m bases/sec

image: screenshot

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by manekineko ▴ 150

0

Entering edit mode

According to that <1% of reads had adapters, doesn't that seem odd?

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

It doesn't, to me... as long as the average insert size is substantially greater than read length, and particularly if the reads are size-selected to a longer insert size, you can get pretty close to zero percent of reads having adapters. To validate this, an insert-size histogram from mapping or merging is quite useful.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

valid point - I agree that a histogram of insert sizes would be useful

ADD REPLY • link 8.7 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

RNA was selected 200-400nt, general TrueSeq protocol.

I used Picard to estimate the insert size mean - ~193, so the distance between the pairs is around zero.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by manekineko ▴ 150

0

Entering edit mode

It looks to me like these are proper small RNAs, and only a small fraction (<1%) of the reads map to them. Why are you doubting this result? Did you do some chemical process that you expected to remove all of the small RNAs?

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

I'm doubting because there is no smallRNA reads after adapter removal according histogram (100% of reads fall into 100nt group):

0    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
10    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
20    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
30    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
40    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
50    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
60    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
70    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
80    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
90    0    0.000%    45578309    100.000%    0    0.000%    4603409209    100.000%
100    45578309    100.000%    45578309    100.000%    4603409209    100.000%    4603409209    100.000%
110    0    0.000%    0    0.000%    0    0.000%    0    0.000%

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by manekineko ▴ 150

1

Entering edit mode

There's something wrong here... you started with 101bp reads. Adapter removal trimmed 0.79% of them and discarded 0.12%. Therefore, you should still have 0.67% of reads remaining that are shorter than 101bp, but the histogram shows zero. Are you sure you ran it on the trimmed reads?

Also, the cigar string of the read mapped is 23M, indicating that there is no clipping and it is 23bp in length.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Oh by bad I have mistaken the filnames so after re-run of the trimmed reads from R1 pair:

So it seems that these are actual smallRNAs probably contaminations in RNA-seq? Do you think I can use it to see come diff.expressed miRs or the mapped reads are very few, As I have some values from cuffdiff on miR genes from that?

0    0    0.000%    45521566    100.000%    0    0.000%    4584970307    100.000%
10    35053    0.077%    45521566    100.000%    526628    0.011%    4584970307    100.000%
20    43342    0.095%    45486513    99.923%    1037073    0.023%    4584443679    99.989%
30    25747    0.057%    45443171    99.828%    883980    0.019%    4583406606    99.966%
40    18330    0.040%    45417424    99.771%    814247    0.018%    4582522626    99.947%
50    17896    0.039%    45399094    99.731%    977696    0.021%    4581708379    99.929%
60    19892    0.044%    45381198    99.692%    1284650    0.028%    4580730683    99.908%
70    29270    0.064%    45361306    99.648%    2191778    0.048%    4579446033    99.880%
80    83122    0.183%    45332036    99.584%    7113941    0.155%    4577254255    99.832%
90    0    0.000%    45248914    99.401%    0    0.000%    4570140314    99.677%
100    45248914    99.401%    45248914    99.401%    4570140314    99.677%    4570140314    99.677%
110    0    0.000%    0    0.000%    0    0.000%    0    0.000%

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by manekineko ▴ 150

0

Entering edit mode

The differential expression program will have a statistical model in it that determines whether the number of reads is sufficient to be significant. However, if you did some selection process that was supposed to remove small RNAs, then you should ignore them because that will bias the results.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k