TCGA - Correlation between gene expression and CNV
1
0
Entering edit mode
5.8 years ago
rin ▴ 40

Hi everyone

I am new here and at the bioinformatics world and I would appreciate your help. I am currently looking into correlating gene expression and CNV data from TCGA, most probably about colorectal or ovarian cancer. After some data exploration, I found out than only a small percentage of samples are from normal tissues. That being said, should the DEGs identification be done only between paired (tumor - normal) samples, even if the statistical power would be low? With the aim of correlating the above mentioned data, a meaningful correlation analysis would be 1. between DEGs and amplified/deleted genes or 2. correlation between the expression (not taking into account differential expression, but all the expression data from tumor samples) and the CNV?

Thanks for helping!

RNA-Seq R correlation tcga cnv • 2.2k views
ADD COMMENT
1
Entering edit mode
5.8 years ago

Yes, the number of Tumour-Normal pairs in the TCGA RNA-seq data is low. Others have somewhat circumvented this issue by not doing any direct comparisons and instead answering the question: 'What is highly and lowly expressed in the tumour and normal samples separately?' This is how cBioPortal does it, and the default is Z-score > 2 for highly expressed and Z-score < 2 for lowly expressed. Z-scores should ideally be produced from the logged, normalised counts.

I would take this approach (above) and correlate the highly and lowly expressed genes to the CNVs.

Of course, any logical approach will be fine.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you a lot for your comments and help, Kevin!

ADD REPLY
0
Entering edit mode

Hi again!

Looking at it a little more, I have seen that a NB distribution is used from DEseq2 and EdgeR to normalize gene expression data, meaning that a Z-score would not be valid ( or at least have similar interpretation) as if when using a normal distribution. Am I understanding something wrong?

Elaborating a little more to make myself as clear as possible. A possible workflow would be:

  1. Check if raw count data downloaded from TCGA follow a normal distribution.
  2. If not, log2 transform.
  3. Remove genes with low read counts.
  4. Calculate mean and st.dev of Gene A across samples >> Get a z-score for Gene A
  5. Repeat for all genes.
  6. Select genes with score > or < 2.

Are there any steps that I am missing/not understanding correctly? In other words, normalization techniques proposed, such as those using median or quantiles, should not be considered?

When it comes to the correlation: CNVs will have to be done by pairwise comparison of normal-tumor samples. Would it still be valid to correlate them to the genes found from the process above?

Thanks once again!

ADD REPLY
0
Entering edit mode

The idea was to download TCGA RSEM counts, normalise them in DESeq2 / EdgeR, produce logged data from this (via regularised log in DESeq2 or logCPM in EdgeR), and then transform to Z-scale. I would then obtain the CN segment data from Broad Institute's Firebrowse server, and, finally, conduct either a correlation or regression analysis between the RNA-seq genes with |Z|>2 or 3 and the CN segments identified. There will obviously be other issues along the way.

ADD REPLY
0
Entering edit mode

Hi Kevin! Coming back to this (almost ancient now) post for a follow-up question!

I used indeed DESeq2 with a design of ~tumor + normal. One think I am quite unsure about is whether I should compute the Z-score, as (expression in my samples - mean expression in normal samples) / st. dev of expression in normal samples from the results of rlog.

Am I missing something?

Thank you!

ADD REPLY
0
Entering edit mode

Hey rin, To transform to Z-scores, you just need to do:

t(scale(t(data)))
ADD REPLY

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6