Question

low expressed gene filtration, quantile normalization and log2 transformation, which one goes first?

1

Entering edit mode

7.4 years ago

ewre ▴ 250

Hi everyone,

I have been dealing with expression data for about 4 years (both microarray and rna-seq). but this question still confuses me when I do data preprocessing. 1) My opinion is that at least we should do low expressed gene filtration first. Reason is that: the aim for quantile/log2 transform is to make the data distribution more proper. but if quantile/log2 goes first and then followed by low-expressed gene filtration, we may break the distribution.

2) For log2 transform and quantile normalization, I really don;t know which one goes first.

Thank you in advance for your time and valuable suggestion.

log2 quantile data transform preprocessing • 3.4k views

ADD COMMENT • link updated 7.4 years ago by Farbod ★ 3.4k • written 7.4 years ago by ewre ▴ 250

0

Entering edit mode

If you Remove low expressed genes first (across the samples/cohort) and then do log transform(FPKM + 1),the results should be fine.

ADD REPLY • link 7.4 years ago by Ron ★ 1.2k

0

Entering edit mode

So is this question about RNA-seq or microarray?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi WouterDeCoster, I want to make it as general for both RNA-seq and microarray.

ADD REPLY • link 7.4 years ago by ewre ▴ 250

0

Entering edit mode

RNA-seq and microarray are both transcriptomics, but that's the end of the similarities. Microarray are continuous intensities, RNA-seq discrete counts (sampled from a negative binomial distribution: overdispersed poisson distribution).

I'll leave microarray analysis for someone else, but most acceptable is for RNA-seq to use tools like DESeq2 and edgeR which model the data assuming this negative binomial distribution. So you don't want to preprocess the data here, because for the software to work optimally it expects raw, unmanipulated counts.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you. exactly, for raw read count of rna-seq data, I usually use deseq2 and edger to do DEG analysis. but sometimes I have to go with only rpkm/fpkm data. that's where I get trouble.

ADD REPLY • link 7.4 years ago by ewre ▴ 250

score 0 · Answer 1 · 2016-11-09

0

Entering edit mode

7.4 years ago

Farbod ★ 3.4k

Dear hanguangchun, Hi.

I think removing low expression then -> log2 transform is more usual.

Also, please have a look at There are too many transcripts! What do I do?

and the IsoPct < 1 section of this paper for excluding the spurious transcripts.

~ Best

ADD COMMENT • link 7.4 years ago by Farbod ★ 3.4k

0

Entering edit mode

Thank you very much for the information, Farbod.

ADD REPLY • link 7.4 years ago by ewre ▴ 250