Normalizing transcriptome data by tissue type
1
3
Entering edit mode
6.3 years ago

Hi guys,

we are working on an university project where we want to find discriminating genes of different cancer types. For that we are using gene expression data of the TCGA dataset.

A naive approach would be to simply run some feature selection on tumor data of each type. However, we assume that this would not identify genes relevant for the tumor but the cell type itself. For example, we want to compare thyroid and lung cancer. Using only tumor data, we would expect that we find differentially expressed genes that are specific not for the tumor but for the original cell type itself. So we want to "normalize" thyroid tumor data with healthy thyroid tissue to find discriminating genes for thyroid tumor first that can now be compared with "normalized" genes for lung cancer.

We have some ideas how to do this ourselves but we suppose that this is not an uncommon task, so has anyone heard of this "normalization" approach and how it usually is done? We suppose that this needs to be done when clustering cancer types to see meaningful differences but we could not find this in the literature we read.

We hope we could state our problem in a comprehensible way, if not, feel free to ask. Thanks for your help!

RNA-Seq Cancer Data Feature Selection TCGA • 2.4k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Can you block on tissue origin? For example, the same way you might incorporate a batch effect (~ batch + group) you instead incorporate tissue origin (~ tissue + tumour)

ADD REPLY
6
Entering edit mode
6.3 years ago

Hey, you could try:

  1. Obtain healthy / normal tissue specific expression data from one or more online databases ( see here: How Can We Get Tissue Specific Genes? - also look up FANTOM5)
  2. Using these databases, determine which genes are specific to your tissues of interest (e.g. thyroid). To do this, just do something like converting the downloaded data (preferably it has a normal distribution) to Z-scores. Thus, 'tissue-specific' genes will have high Z-scores in the tissues in which they are most expressed; whereas, ubiquitously expressed genes will have low Z-scores

After that, when you conduct your differential expression comparison in tumours between thyroid and lung, you can just filter the list of differentially expressed genes for the genes encountered via the above 2-step approach) - this is the simplistic approach.

There should be a way, however, to actually 'normalise' your tumour expression data based on the results that you obtain from the tissue specific databases. For example, one could supply the healthy / normal tissue-specific data as priors to an empirical Bayesian regression model and then adjust your expression data based on these priors.

Also keep in mind that 'thyroid' and 'lung' refer to many different cell- and tissue-types.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, thank you for the very helpful reply! Just two short (and hopefully short to answer) follow-up questions: Do you know literature where these or similar methods were used? And, keeping in mind that there are many different cell- and tissue types, would you agree with us that this normalization approach in general is a meaningful thing to do?

ADD REPLY
1
Entering edit mode

Hello, sorry, no published literature behind this. However, it is part of work that is currently under peer review.

Yes, I believe you must correct / adjust for the different tissues in this situation.

ADD REPLY
0
Entering edit mode

Thanks again, could you maybe drop us a link should you notice the work was approved and published?

ADD REPLY
0
Entering edit mode

Sure, although, I post in a lot of threads here! It will be easy to forget. Maybe you could contact me at a later time via my email on GitHub.

ADD REPLY
0
Entering edit mode

Hi Kevin, "could you maybe drop us a link should you notice the work was approved and published?" could you share the link to the reference that has been mentioned in the above post?

ADD REPLY
0
Entering edit mode

My former colleagues are dreadfully slow with it - I do not currently know the status. I have published results from other studies that even began in the time after I left that group.

ADD REPLY

Login before adding your answer.

Traffic: 3694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6