Question

Get 3'UTR sequence using RNAseq read coverage

0

Entering edit mode

5.1 years ago

joangibert14 • 0

Hi!

I have some bam files aligned to mouse reference genome. I want to analyze some motif enrichment in the 3'UTR of this samples but I realize that some mRNA have longer 3'UTR than the reference genome. I double checked this an is something reported in platelets that is what I'm working.

Is there any way to get the full 3'UTR using, for example coverage and expand the preexistent 3'UTR bed file? I was thinking on a kind of loop that check coverage every 100bp and, if coverage is > threshold, append the sequence... Any other ideas? Actually I am not quite sure that the previous idea would work...

Thanks!

RNA-Seq • 2.7k views

ADD COMMENT • link updated 5.1 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 5.1 years ago by joangibert14 • 0

1

Entering edit mode

I think what you mean is that the actual 3'UTR is longer than the annotated in the reference file (GTF). Better use a dedicated tool, such as stringtie to perform reference-guided transcriptome assembly and then check if you indeed find transcripts that extend beyond the annotated 3'UTRs. Still, keep in mind that for motif enrichment your query sequences should not be excessively long to avoid motif matches by chance. Homer for example uses 200bp as a default window.

ADD REPLY • link 5.1 years ago by ATpoint 81k

1

Entering edit mode

Or use a tool that does comparative analysis: DREAM will allow you to put in a set of negative sequences and find things that are enriched in your postive set vs the negative set, free of length confounding.

ADD REPLY • link 5.1 years ago by i.sudbery 19k

0

Entering edit mode

I tried stringtie with gencode gtf for mouse. The file I got does not have any line with UTR feature. The gtf contains this information and also when I check the bam files in IGV I see reads in the UTRs. Any idea what is going on?

ADD REPLY • link 5.1 years ago by joangibert14 • 0

1

Entering edit mode

stringtie has no way of distinguishing between coding and non-coding exons. Thus is can identify where the exons are, but not which bit of them are coding/non-coding. One solution would be to select transcripts from your new assembly that are identical to reference transcripts, except for the extended first/last exon, and copy across the CDS annotation from the reference transcript.

ADD REPLY • link 5.1 years ago by i.sudbery 19k

0

Entering edit mode

Uhm, the feature column in the output should. At least is what it claims in the documentation, right?

Anyway, I am not sure what you are referring to. Is something like comparing the two gtf files (the mouse reference and the stringtie output) and get the lines from the second that are not in the first?

Thanks

ADD REPLY • link 5.1 years ago by joangibert14 • 0

1

Entering edit mode

I think you are miss-reading the manual. To me the bit I think you are reffering to is just a cut and paste job of the definition of the GTF format. I've never seen anything come out of stringtie other than transcript and exon lines. How would it know where the CDS/UTRs were?

So, yes I was talking about comparing the GTF files, but you can't do it on a line by line format, you'd need to consider a whole transcript at once, which is multiple lines and compare transcripts by checking their intron chains. You might take a look at cuffcompare which will classify novel transcripts according to how they match known transcripts.

ADD REPLY • link 5.1 years ago by i.sudbery 19k

score 1 · Answer 1 · 2019-03-04

If you have StringTie output IsoformSwitchAnalyzeR can help you out. It can predict ORF (with high accuracy) of the transcripts whereby you also can get the 3'UTRs. From this it is quite easy to extract both the genomic cooridnates - and even easier to extract the corresponding genomic sequence.

Furthermore if you have conditions IsoformSwitchAnalyzeR cab analyse your data for systematic differences in 3'UTR length between conditions - both in individual genes and on a genome wide level.