Question

Identifying isoforms de novo

1

Entering edit mode

6.8 years ago

jjrin ▴ 40

Hello biostars. I recently ran rsem on my RNA-seq data and came back with unusual isoform results. In the IsoPct (isoform percentage) column I only have 100 or 0, meaning that there was no identifiable gene/isoform or that there was only one isoform of the gene. However, I find this highly unlikely and I believe that this is due to the fact that isoforms were not directly annotated in my reference genome or actual data. How would I go about finding the isoforms for my genes using fasta files (that have individual genes not divided by chromosome) and bam files for my varying conditions? I do not have isoforms annotated so this will have to be de novo.

I have tried various programs available such as flipflop which requires sam files, I have bam files that are much too big to convert (>10 gb). Also, I have tried GESS which requires fasta files for each chromosome in the reference genome (I only have a reference genome with all of the individual genes not divided by chromosomes). I used hisat and HTseq to retrieve my bam files and gene counts.

Much appreciated.

RNA-Seq gene • 2.1k views

ADD COMMENT • link updated 6.8 years ago by Istvan Albert 100k • written 6.8 years ago by jjrin ▴ 40

0

Entering edit mode

Was this a de novo assembly?

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Not entirely, we had a reference genome (Taejoon lab) for assembly and for identifying exons, CDS, etc. However, isoforms weren't included and we would like to find a way to identify them.

ADD REPLY • link 6.8 years ago by jjrin ▴ 40

score 0 · Answer 1 · 2017-06-20

0

Entering edit mode

6.8 years ago

Istvan Albert 100k

In this case you may not need a full de-novo transcript assembly since it appears that the genome is already known. This makes the process a little easier. Traditionally people used Cufflinks to do this, but the tool has been falling out of favor. For more current methods consult the literature.

See for example the Transcript Discovery section in

A survey of best practices for RNA-seq data analysis, Genome Biol. 2016

ADD COMMENT • link 6.8 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you for your reply! I read that section and it brings up Cufflinks as well as iReckon, SLIDE, and StringTie. However, Montebello seems the most applicable since it couples "isoform discovery and quantification". Do you have any preference or experience with these methods? It also seems that most of these methods require sam files which require much more rigorous processing, which could prove to be a problem.

ADD REPLY • link 6.8 years ago by jjrin ▴ 40

1

Entering edit mode

Aligning against known transcripts can't tell you much about unknown transcripts.

In general, to discover new isoforms you will need alignments against the genome. If those are not available then you would need to perform a de-novo transcriptome assembly of the reads.

ADD REPLY • link 6.8 years ago by Istvan Albert 100k

0

Entering edit mode

What if i still have the whole genome fasta file available? GESS doesn't ask for the gtf annotation file so would it be beneficial to work with the whole genome data to find isoforms or not? I'm currently working on slide which isn't working as well as my references are invalid '6L'.

ADD REPLY • link 6.8 years ago by jjrin ▴ 40