Question

Filtering bad quality probes in illumina microarray data

3

Entering edit mode

17 months ago

seta ★ 1.9k

Dear all,

I used this workflow to analyze the Illumina microarray data, GSE35088. However, I did not obtain DE genes while GEO2R gave about 1000 DE genes. As I found that GEO2R does not perform the filtering step while when I filtered the probes (control probes, those with no symbol, and those that failed), only about 800 probes remained out of 22523 probes, is it usual or something is wrong?

Also, I guess the normalization step is different between the two workflows, yes?. so using GEO2R would not be safe for getting accurate results, is it right? kindly let me know if you have any suggestion/advice?

thank you.

microarray illumina Kevin-Blighe GEO2R • 1.6k views

ADD COMMENT • link 17 months ago by seta ★ 1.9k

score 0 · Answer 1 · 2022-11-12

0

Entering edit mode

17 months ago

Gordon Smyth ★ 7.1k

This data has a published analysis https://doi.org/10.1371/journal.pone.0040161 that appears to demonstrate good quality data.

By contrast, throwing out almost all the data as you report doing does not sound remotely sensible. How have you convinced yourself that almost all the probes have "failed"?

ADD COMMENT • link 17 months ago by Gordon Smyth ★ 7.1k

0

Entering edit mode

Thank you, Gordon.

Yes, it is not reasonable in my view, too. However, after normalization, I filtered the un-expressed (failed) probes that have detection p-value > 0.05. I kept those probes with the detection p-value <= 0.05 in at least 3 arrays as it is the default of the limma if I correctly remembered.

Your valuable suggestion would be highly appreciated

ADD REPLY • link 17 months ago by seta ★ 1.9k

1

Entering edit mode

A probe being unexpressed is not the same as "failed", nor does it mean that the probe is of poor quality. I do not know of any circumstances where an Illumina beadchip probe can be said to have "failed". A probe that reports that the corresponding gene is not expressed is doing its job correctly.

Anyway, I suspect that you may have been tricked by the fact that Illumina sometimes reports detection p-values such that p < 0.05 means expressed and sometimes reports p-values such that p > 0.95 means expressed. In other words, the detection p-values are sometimes 1 minus what you expect them to be. I suspect that these arrays are using the latter version whereas you're assuming the first. If you used the limma functions read.ilmn and neqc to process the arrays, then limma automatically checks which way around the p-values are.

For this dataset, when I check how many probes are significantly above background in at least 3 arrays, I get the following:

> library(limma)
> x <- read.ilmn("GSE35088_non_normalized.txt.gz", probeid="ID_REF")
Reading file GSE35088_non_normalized.txt.gz ... ...
> dim(x)
[1] 22523    24
> y <- neqc(x)
Note: inferring mean and variance of negative control probe intensities from the detection p-values.
> keep <- (rowSums(y$other$Detection > 0.95) >= 3)
> table(keep)
keep
FALSE  TRUE 
10646 11877

Note that probe filtering for this dataset should take into account the design of the experiment, which in this case includes technical replicates and about 8 arrays per experimental condition. Checking detection in >=3 arrays is not a universal recommendation or a default in limma.

ADD REPLY • link 17 months ago by Gordon Smyth ★ 7.1k

0

Entering edit mode

Thank you very much.

Half of genes are truly expressed, which is completely reasonable. Sorry, could you please let me know how we should find 0.05 or 0.95 detection p-value cutoff?

ADD REPLY • link 17 months ago by seta ★ 1.9k

1

Entering edit mode

Just check whether the detection p-values increase or decrease with the intensities for each array. Any quick look at the detection p-values will tell you which of those is true.

ADD REPLY • link 17 months ago by Gordon Smyth ★ 7.1k

0

Entering edit mode

Many thanks for your support!

Sorry, in the case of not filtering probes based on the detection p-value, I noticed that the number of DE genes is about two-fold more than when I keep only subset of genes. It is probably related to the more accurate variance estimation in the presence of all genes/probes, isn't it? so, in this way, obtained DE genes are reliable? could you please share your suggestion on this issue, keeping all genes or a filtering based on detection p-value?

Thank you

ADD REPLY • link 17 months ago by seta ★ 1.9k