calculate z-score from rpkm values.
1
1
Entering edit mode
8.9 years ago
vyom84 ▴ 40

Hello everyone,

The RNA-seq data from TCGA contain the Z-SCORE, instead of RPKM values. I wanted to perform analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/gene_expression_comparison.html), but I am not sure it works for RNA-seq or it is the correct tutorial.

Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.

Thanks

Z-score RNA-Seq GTEx TCGA • 14k views
ADD COMMENT
0
Entering edit mode

Where are you downloading your TCGA data? TCGA Data Portal or elsewhere? TCGA provides 2 types of RNAseq data. The older pipeline is just called RNAseq, while the newer pipeline is referred to as RNAseqV2. Which one are you using? Both version provide the raw counts as well as RPKM for the old pipeline and RSEM scaled estimate counts for the RNAseqV2 data. These data are not z-scores. Please see this site for more details on RNAseqV2 data: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2

ADD REPLY
0
Entering edit mode

Thanks for the reply and the link.

I have download the data directly from COSMIC. Below is the head of the expression file:

ID_SAMPLE    SAMPLE_NAME        GENE_NAME  REGULATION  Z_SCORE  ID_STUDY
1337808      TCGA-02-2483-01    SFMBT1     over        2.416    329
1337808      TCGA-02-2483-01    SGCE       normal      -0.274   329
1337808      TCGA-02-2483-01    RCAN2      normal      -0.577   329
1337808      TCGA-02-2483-01    KIAA0895   normal      -0.333   329
1337808      TCGA-02-2483-01    LIN7C      normal      0.428    329
1337808      TCGA-02-2483-01    NDUFV2     normal      0.790    329
1337808      TCGA-02-2483-01    AKAP14     normal      -0.275   329
1337808      TCGA-02-2483-01    ST20       normal      0.808    329
1337808      TCGA-02-2483-01    CLCF1      normal      -0.640   329

I think z-score is easy to understand and explain. So, I wanted to convert the GTEx data (counts or RPKM) values into z-score. Please advised how to convert the GTEx data into z-score, I have both the reads and counts. Below is the sample of GTEx data:

TargetID             Gene_Symbol          Chr  Coord      GTEX-N7MS-0007-SM-2D7W1  GTEX-N7MS-0008-SM-4E3JI  GTEX-N7MS-0011-R10A-SM-2HMJK  GTEX-N7MS-0011-R11A-SM-2HMJS  GTEX-N7MS-0011-R1a-SM-2HMJG  GTEX-N7MS-0011-R2a-SM-2HML6
ENST00000390859.1    ENSG00000212161.1    1    159821762  0.000000                 0.000000                 0.000000                      0.000000                      0.000000                     0.000000
ENST00000290551.4    ENSG00000159388.5    1    203274619  4356.000000              2685.999512              6367.449707                   9156.000000                   4144.000000                  5143.111328
ENST00000475157.1    ENSG00000159388.5    1    203274619  0.000000                 18.000416                8.550518                      0.000000                      0.000000                     32.888611
ENST00000429660.1    ENSG00000223683.1    1    63727784   0.000000                 0.000000                 0.000000                      0.000000                      0.000000                     0.000000
ENST00000473120.1    ENSG00000092850.7    1    36549676   0.000000                 0.000000                 19.124006                     18.559824                     0.000000                     23.603249
ENST00000207457.3    ENSG00000092850.7    1    36549676   0.000000                 0.000000                 112.875992                    53.440178                     207.776093                   112.396751
ENST00000469024.1    ENSG00000092850.7    1    36549676   0.000000                 0.000000                 0.000000                      0.000000                      52.223907                    0.000000
ENST00000489568.1    ENSG00000162461.7    1    16062900   0.000000                 0.000000                 0.000000                      0.000000                      0.000000                     0.000000
ENST00000294454.5    ENSG00000162461.7    1    16062900   68.000000                132.000000               696.000000                    5004.000000                   512.000000                   312.000000

Thanks

ADD REPLY
0
Entering edit mode

I don't use COMIC much, so I researched this a little. My thought would be to calculate the z-score like COSMIC does, but the problem is that the help file they link to for more information is broken. You may want to contact them about this. Then I found this Biostars answer by @David Fredman TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?, which I think answers your question. It looks like COSMIC uses the z-score so that they can have comparable gene expression across various platforms in TCGA; however, it is not clear if COSMIC calculated them or they were taken straight from TCGA. I think this is a good way to go since you are comparing yet another platform. However, note in the response by @David Fredman that TCGA uses the distribution of the normal tissues (sometimes!) as the control distribution. You should be careful comparing these z-scores to your data if you do not have a comparable control to use because you will get false positive results. My suggestions would be to 1) contact COSMIC to see how they calculated theses z-scores and 2) if that doesn't work out to download the original TCGA count data and calculate the z-scores yourself so that you know what is being used as the control distribution.

ADD REPLY
0
Entering edit mode

Just out of curiosity, are you trying to compare GTEx & TCGA? If yes, then why not use only TCGA as it already has data for matched Normal-Tumor samples.

ADD REPLY
2
Entering edit mode
TCGA's "tumor adjacent" normals for rna-seq are not perfectly normal, due to lymphocyte infiltration or some other tumor microenvironment or epigenetic effects that need further investigation.
ADD REPLY
0
Entering edit mode

Never gave it a thought. Thanks!

ADD REPLY
3
Entering edit mode
8.9 years ago

You can calculate a z-score by subtracting the mean and dividing by the standard deviation for each gene. You may want to transform your data before doing this (log-transform or other).

ADD COMMENT
0
Entering edit mode

It should probably be pointed out that while this will yield z-scores, the scores themselves may or may not be very meaningful. In fact, unless all of the samples have been normalized in a meaningful way ahead of time the resulting z-scores won't be comparable.

ADD REPLY
0
Entering edit mode

I totally agree. While the calculation can be done, the utility of doing so remains unknown.

ADD REPLY
0
Entering edit mode

Reading the literature and comments, my understanding of the z-score:

  1. Convert the count/RPKM values of each gene into log values.
  2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose I have data for 20 samples).
  3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.
  4. Now. I have the z-score for gene x in first lung tissue sample. Using the above protocol, I can convert all genes log values into z-score.

The question is the above protocol is correct or not, please advised.

Should I calculate the z-score using reads count or RPKM values.

Does these z-score really have meaning. The z-score COSMIC provide:

ID_SAMPLE    SAMPLE_NAME    GENE_NAME    REGULATION    Z_SCORE    ID_STUDY
1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329
1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329

If I calculate the z-score using above approach, should I be able to calculate the z-score and find out whether the gene is over regulated or normal regulated.

Please advised how to proceed.

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6