Hi,
I was using gencode mouse annotation file vM17 for quantifying genes for RNA-seq. I am also interested in lncRNA quantification. I used two different annotation files. First to quantify all transcripts I used the comprehensive annotation file (primary assembly). Then to quantify just the lncRNA, I used the gencode mouse lncRNA annotation file. Now I know that the comprehensive files should have all the lncRNA, so I compared the FPKM values calculated from comprehensive and lncRNA annotation files.
What I observe is that the FPKM values are different. The trend is same, so for example in three condition if using comprehensive annotation file I get following values :A= 2, B=4, C=8; then when I use lncRNA annotation file I get A=6, B=11 , C=23 (example for representation purpose only). I just wanted to ask opinion of experts if I should use FPKM values from lncRNA annotation or the comprehensive file.
I am assuming that in the lncRNA notation, when the reads fall in a region that might overlap with mRNA, it is counted towards lncRNA; as there is no mRNA annotation. However, in case of comprehensive annotation; the read is decided based on where the overlap is more prominent. This is just my thinking.
Please guide me understand what should be my choice: comprehensive or lncrna?
Thanks!!
To add to what Grant has said, which is perfectly valid, I have to state that FPKM should not be used anymore. There is a very well cited manuscript ( A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis ) that states the following:
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
Thanks Kevin and Grant.
I am using stringtie to convert bams to gtf (that contain the fpkm and tpm values). There is also a python script by the group that can convert it into reads. But I guess that the algorithm converts fpkm into reads. Do you have experience with that? Just want to make sure that the algorithm doesn't suffer from the same bias.
You have (at least) two trustworthy and reliable ways to convert your bam files to read counts: