Statistical tests for RPKM comparison
1
1
Entering edit mode
8.9 years ago
rpjl1230 ▴ 20

Hi, I was wondering if anyone can provide some suggestion as to what statistical tests I can use for differential gene expression with RPKM values.

I am stuck with RPKM as DESeq or edgeR is not an option for me unfortunately. I have generated some RNA-Seq data from a dozen tissue samples collected from patients with 2 disease phenotypes (n=6 for each group) and I would like to compare my result to the published RNA-Seq data using in vitro systems. Unfortunately the authors of the published paper only provided a table on RPKM values (and singleton as well for each of the tested conditions!). They also did not deposit any of the fastq files into any database so I cannot even re-do the alignment.

In this scenario where I have no raw counts on the published data and uneven group size, is there anyway that I can still reliably compare my data to the existing ones and do differential expression analysis? Any suggestion will be very helpful!

Many thanks!

RPKM differential-gene-expression • 5.9k views
ADD COMMENT
0
Entering edit mode

RPKMs are terrible for statistics. Would it be possible to just analyse your samples and compare the resulting fold-changes/DE genes to those from the published study? That might yield nice results.

ADD REPLY
0
Entering edit mode

I thought about doing FC as well but have another dilemma. What would you use as a cut-off? Presumably I will have to use some arbitrary cut-offs, e.g. if >2 log2FC then highly abundant. Is there any way to make it more objective?

Also, to do FC, will you:

  1. Just a standard FC calculation of my data relative to the published data? Or
  2. Calculate FC from the median RPKM for each sample and then compare the perturbation of my patient data to the in vitro data? I thought about this because of the patient samples have pretty high intragroup variance (some samples have much higher/lower median RPKM than others within the same group) and I am not sure how to "normalize" properly?

Thanks again!

ADD REPLY
0
Entering edit mode

I am facing a similar situation with data on GEO only being present in log2 FPKM. After you converted to TPM, how did you end up testing for differential expression?

Thanks!

ADD REPLY
0
Entering edit mode
8.9 years ago
John 13k

This is a very good article about the pros and cons of doing stats on FPKM, and how to calculate TPM, which is by all means a better stat when you don't have much else to work with.

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

There are plenty of others, but that link also links to all the good ones too :)

But to answer the higher-level question of not "how can you" but "why should you" do differential analysis, consider the following sources of noise:

  • your technical "hands" verses theirs (experimental noise)
  • in vivo VS in vitro (biological noise - this will most likely be huge)
  • differences in sequencing technology (paired vs single)
  • differences in data quality and total data (read depth, better mapping, etc)
  • all the statistical noise inherent in going in to an analysis 'blind' without looking at a specific region/difference to begin with.

We've all got to prioritise time, and I think, personally if it were me, I would either do a very rudimentary analysis to get a ballpark figure for overall correlation, or, not even bothering to download the external data in the first place, and focus my efforts on doing the very best/cleanest analysis with, what sounds like (n=6) some pretty good data by RNA-Seq standards :)

ADD COMMENT
0
Entering edit mode

Thanks John! I have already compared my two in vivo phenotypes and the results are indeed quite nice, but we wanted to compare to in vitro conditions to see what stress factors might be associated to each of the two in vivo conditions, hence the published data come handy.

You are absolutely correct on the differences between our technology, data quality and all the statistical noise - it's almost like comparing apple to pear, but getting some ballpark figure will be nice enough for us.

One question regarding TPM. I have been doing some readings (including the Wagner paper) but I don't quite understand what Z (mean read length) stands for? Is it referring to the average bp of the reads from the sequencer? Or is it referring to the average bp mapped to each gene in a given sample? Will greatly appreciate it if someone can clarify for me. Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6