Question

Are all RPKM/FPKMs in RNAseq "equal" ?

1

Entering edit mode

5.1 years ago

CrazyB ▴ 280

I am seeking answers to my specific question and general references for this question -

My superficial understanding of RPKM is the term is a unit of RNA-seq output and refers to read per kilobase per million read. The calculation takes into account the library size and mRNA length to provide a generalized unit to approximate the absolute quantity of transcripts in the samples that were analyzed. As a general rule, the expression of gene X can be compared among different samples for the relative expression in the same study. What is not clear to me is if the expression can be compared among different studies, assuming studies were conducted under the similar/same protocol - i.e. the library prep, sequencing machine, depth, most/all parameters were pre-specified by the protocol.

My question is, in the space of cancer transcriptome,

can one RPKM number for gene X in a lung tumor in TCGA database be compared to another RPKM number for gene X in another lung tumor from another non-TCGA study but following TCGA guideline, ?

Is re-processing of fastq data from two different studies sufficient (or how sufficient is it) to mitigate the study/batch-effect ?

General references to help my improve my understanding are especially welcome.

RNA-Seq rpkm • 2.0k views

ADD COMMENT • link updated 5.1 years ago by Charles Warden 8.2k • written 5.1 years ago by CrazyB ▴ 280

score 7 · Answer 1 · 2019-03-27

I will keep my answer short, you can read more details at the links at the end.

As a general rule, the expression of gene X can be compared among different samples for the relative expression in the same study.

This is incorrect. TPM and RPKM/FPKM are within sample normalizations, to allow comparison of level of expression of different genes from the same sample. This is a result of RNA-Seq being a relative measurement, not an absolute one.

can one RPKM number for gene X in a lung tumor in TCGA database be compared to another RPKM number for gene X in another lung tumor from another non-TCGA study but following TCGA guideline, ?

Is re-processing of fastq data from two different studies sufficient (or how sufficient is it) to mitigate the study/batch-effect ?

No and no, for both questions. As stated above, RPKM is a within sample normalization. As for batch effect removal, in addition to being re-processing the data uniformly, it is necessary the batch effects and experimental are not confounded (one is not a linear combination of the other), otherwise you can't untangle their effects.

What the FPKM? A review of RNA-Seq expression units

In RNA-Seq, 2 != 2: Between-sample normalization

score 2 · Answer 2 · 2019-03-27

Your library preparation can definitely affect the FPKM expression values (stranded versus unstranded, polyA versus ribosome-depleted, etc.). Your quantification can also vary (reference set, unique versus multi-mapped allowed counts, etc).

So, I would agree that the values from different are not perfectly comparable, but I think there are certain characteristics that are common enough that it might be useful. For example, the FPKM value is calculated independently for each sample (so, you don't have to worry about differences due to the overall set of samples you are comparing - for example, normalized counts will change for a sample, depending upon which sample has the fewest counts, or possibly other factors).

In terms of the correction, it is extremely important that the factors you want to adjust are randomized. Otherwise, I would be critical of the results, even if the program ran without reporting an error message. I would also recommend you visualize expression before and after adjustment, to make sure there aren't any weird effects that might be due to over-fitting. You can also try doing something like centered expression among each group you want to adjust (prior to visualization) as a sort of baseline comparison, but then you definitely have to focus on relative (rather than absolute) expression.

Similary, in terms of the batch effects, you might want to play around with the UCSC Xena Browser. They use normalized counts instead of FPKM values, but I would argue visualizing things like GAPDH can help show if batch effects are theoretically possible (although you definitely need randomization to confirm it is really a batch effect and not a real difference that could be incorrectly adjusted, such as a gene that really varies between tumor and normal, but one study only has tumor samples and the other study only has normal samples).

In terms of a Xena Browser example that you can try out:

1) In the far left column, select "I know the study I want to use" and enter “TCGA TARGET GTEx”

2) In the second column, make sure the "genomic" radio button is selected. Enter a gene like "GAPDH" or "EFGR" and press enter (and then click "Gene Expression" and "Done," for each gene).

3) In the next available column on the right, similarity add all 4 "phenotype" variables that are provided by default (although you can select more, if you want).

4) Now, above the columns, you can enter text (where it says "Find samples..."). Try entering something like "lung" and click the upside-down triangle hamburger icon to select "Filter"

5) Click the icon with the 3 bars in the upper-right corner to plot gene versus phenotype expression (you can also then click the icon with 3 full bars to go back to "column" bode).

Again, you would really want to see if the GTEx normal and TCGA adjacent normal were similar (and the tumor samples can't really be randomized among the GTEx samples), but I believe this gives you something to think about.