Biostar Beta. Not for public use.
How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median
2
Entering edit mode
3.0 years ago
fbrundu • 280
European Union

I want to transform a TCGA mRNA expression matrix (in linear data format) to log2-ratios and then run a feature (gene) selection, selecting the 1000 most variant genes (genes with higher standard deviation across samples). The workflow is the following:

1. Select "good" genes before log2ratio (genes each with median signal at least t in p% of samples);
2. On selected genes, run log2ratio, dividing each gene by its median signal and then log2-transforming the result matrix;
3. Select the 1000 most variant genes along all samples.

How do I select t and p?

ADD COMMENTlink
0
Entering edit mode

Hello fbrundu!

It appears that your post has been cross-posted to another site: http://biology.stackexchange.com/questions/29878/how-to-select-genes-before-log2-ratio-on-a-rnaseq-gene-expression-matrix-based

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink
0
Entering edit mode

Hi Devon,

Yes, I am sorry that I annoyed you. But since they are different communities (afaik they are also run by different organizations) and they address slightly different topics, I did not know which was the correct place to post this question to. I think that my question is semantically correct for both communities (even if they can address different types of users), which may have an users intersection (I did not post elsewhere).

Sorry about that.

Francesco

ADD REPLYlink
3
Entering edit mode
17 months ago
National Institutes of Health, Bethesda…

There is not a general solution to select "t" and "p". Such choices are largely arbitrary. Furthermore, for an array platform, if one assumes that "t" has something to do with "expressed", the value for "t" will differ for each probe on the array.

Since you are ultimately going to filter based on variance, I'd suggest starting with your median-centered, log-transformed data and simply choose the top 1000 most variable genes.

ADD COMMENTlink
1
Entering edit mode

The data that OP is referring to is RNAseq; so no probes. Sequencing bias correction can be done for them.

ADD REPLYlink
0
Entering edit mode

Thanks Sean. Regarding your last suggestion, I thought that it could introduce problems, since genes with very low median signal can show a high variance when logratio transformed. What do you think?

Thanks

ADD REPLYlink
0
Entering edit mode

Well log transformation will bring down the variance. Imagine 4 samples [0.5, 2, 8, 32]. Without log transformation the variance is 213.5625 but when you log2 transform the data then the variance reduces to 6.67

In any case if the expression is consistently low then the variance will be low. You should be careful about log transformations especially when doing differential expression studies. I would suggest that you do the log transformation after selecting for median and variance.

ADD REPLYlink
0
Entering edit mode

I followed your first advice using custom thresholds.

Hoping this is useful, the code of the pipeline is available at https://gist.github.com/fbrundu/adf45f7ce817572bba10

I filtered out the genes which were below the overall 5th percentile in more than the 5% of the samples. I think it could be a reasonable threshold. In case tell me.

Thanks

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1