Using Biological Replicates To Filter Growth Dependant Probes In Microarray Experiments
2
3
Entering edit mode
12.8 years ago
Sander Timmer ▴ 710

For our current microarray study we have performed for 2 biological samples 5 replicates to help us filter our growth dependant probe variance.

So far we're able to see that there are quite some probes that have a higher variance within the replicates than between samples. Though defining a proper filter cutoff is hard. Are there any standardised methods of doing is? Any papers of methods people would suggest that helps us defining an optimal filter?

To clarify a bit. The samples we have are treated and non treated and then measured using our own tiled-array design. In our GWAS we use the ratio between treated and non treated as our signal (think of it as eQTL's) that are associated with HapMap genotypes. So in total we have 60(2) samples measured and 25(*2) replicated samples measured. These last replicates we would like to use to filter out intra individual probe variation to some "extend" to create a clearer overall signal. In general what I'm looking for at the moment if there is (if any) standard way of setting any kind of filtering threshold based on these replicates.

*To emphasise, the GWAS a step further downstream and is in my opinion not really relevant for this question, that's why in the first place I didn't mentioned it. *

microarray probeset filter replicates • 4.2k views
ADD COMMENT
0
Entering edit mode

Could you elaborate on your experiment? What are the variables in your experiment? What is the hypothesis you are testing?

ADD REPLY
0
Entering edit mode

Our experiment consists out of microarrays for 60 individuals. For 2 of them there are 5 replicates (5 adjacent days). Reason for this is that these samples are human cell lines which most likely adds a layer of growth variance . We would like to filter out targeted regions that are more influenced by growth to lower than actually by individual based variance.

ADD REPLY
0
Entering edit mode

And what is the biological hypothesis that you want to test? How do the 60 arrays from individuals come into it?

ADD REPLY
0
Entering edit mode

We're performing a GWAS study on these 60 samples to find heritable chromatin statuses. Though, because of that these are cell lines we have seen quite some noise for some probes within the replicates. We can simple filter these by saying we remove the top 10% variant probes from our dataset before summarisation.

I was just wondering if anyone knows of a better/improved/de-facto way of doing this. I cannot image than we're the first to try something like this to improve microarray results.

ADD REPLY
0
Entering edit mode

Wait a sec... What kind of arrays are these? You are doing GWAS, meaning they are SNP arrays? If so how do you explain the variation in the first place? (and no if it is about expression arrays you are not the first to try to heck the relation between intra and inter individual variation).

ADD REPLY
3
Entering edit mode
12.8 years ago
Lyco ★ 2.3k

In Microarray experiments, there is no clear-cut way of defining cut-offs for anything, other than this simple rule: You should only consider expression differences that are significantly beyond what you would expect from replicates.

You describe cases where you observe 'a higher variance within the replicates than between samples'. Unless there is some major experimental problem, this can be explained only by chance alone, in the best case 'different samples' are not fundamentally different from 'replicates', e.g. when the differences are minor.

You should be aware of the different sources of variations:

  • technical random noise (inter-array differences). This is what you see in purely technical replicates (split an RNA sample in two halves, process them separately and hybridize on different array). This variability is superimposed onto everything that you measure. In my experience (many 1000s of microarray) this is nowadays not too big a problem and is usually smaller than the other noise sources. However, it is a technical frontier, so you can't expect more accurate results than that.

  • biological variability. Can be a big problem in outbred populations (such as humans). There are plenty of 'expression polymorphisms' caused by inter-individual differences. Even worse, there are differencs not only in the baseline expression leven (which you could 'normalize out' but there are also difference in the response to various stimuli. You would really have to sample many individuals before claiming that an observed difference is a general phenomenon.

  • sampling variability. This is a major issue that is often not sufficiently recognized. It is caused by (hidden) variantions in sample preparations. Factors to be controlled are i) time of the day, ii) nutrition status, iii) drugs taken, iv) cell composition of biopsy samples, etc. Which of these factors is important depends on your system. Often iv) is of major concern because even minute contaminations with other tissue (blood! fat!) can lead to dramatic changes in the expression profile

  • systematic variations. These can often be avoided but nevertheless might be a big problem. Possible causes are batch effects (of the array, enzymes, buffers) or a change in the microarray operator. Sometimes even a different hybridisation chamber or a different room temperature can make a big difference.

These sources of variations have to be weighed against the expression changes you expect to see. Sometimes the expression changes are so big that you hardly have to worry about variations (e.g. when exposing cells to toxins or LPS or such) but often you have to.

If you expect subtle expression changes (which you probably do, judging by your question) you should pay a lot of attention to your replicates, maybe even increase the number of replicates. When you want to judge if a given expression difference is meaningful, you have to apply statistical tests (e.g. t-test or ANOVA) to calculcate the significance, e.g. the probability that this observation was by chance alone. Very important: when doing statistics on microarray results, don't forgt to apply a correction for multiple testing (e.g. Benjamini-Hochberg). By doing this, you will gen an 'implicit cut-off' which is directly based on the variability you observed in your replicate samples.

ADD COMMENT
0
Entering edit mode

Thanks for your answer, very high quality!

We think that because we're working with growing cell tissues that there is some major sampling variability. Thats why we have for 2 samples a measurement for 5 adjacent days (this is growing samples). What we would like to do is to get rid of these probes that are consistently changing in these replicates because this is a growth-effect instead of a biological relevant effect.

ADD REPLY
0
Entering edit mode

So you have two kinds of cell cultures, (one treated, one not) or (one behaving that way, one another) and you plan to tie the differences to some kind of expression change, exluding those that are just due to the prolonged time in culture? Or do you expect the cultured cells to undergo some 'transformation' that is mirrored by expression changes which are not just due do culturing? In either case, you need one type of culture that just grows and does not undergo the change/treatment. You would then look for expression changes that go beyond the variability in the control culture.

ADD REPLY
0
Entering edit mode

One recommendation: do not expect too much 'consistency' in the expression changes, i.d. don't expect that expression goes up or down continuously, one time point stronger than the previous one. Consistent trends like this are often killed by random noise, e.g. technical variability. If you have to look for contiuous trends, better allow some outliers! Moreover, I wouldn't expect too many continous changes but rather sudden changes if things change (end of log phase, confluence, senescence, whatever)

ADD REPLY
3
Entering edit mode
12.8 years ago

Assuming they are indeed expression arrays (I don't really understand how that relates to you saying you are doing GWAS), I think you can actually follow a very practical approach. Your "intra individual" variation (the variation you see in your time series of the same sample) for some genes will be large relative to the inter individual variation. How large you think that needs to be to become a real problem is largely arbitrary. But if it is that means you cannot meaningfully measure the variation you are really interested in. SO you should blacklist those genes. I think that is what you planned to do, right? If so I think that definitely makes sense.

The reason why this "intra individual" variation is high is actually not that important for your purpose. Some of the errors Lyco mentions occur for all samples, and thus are part of all variations, some are not. You might for instance indeed have a growth effect, and some of the genes might simply be lower expressed and harder to measure leading to a higher technical variation (affecting both, but maybe making the intra-individual variation too large).

I deliberately called it variation. It doesn't really matter whether that is consistent over time. You could just calculate the standard errors for both variations (take the average of the 2 variations of 5 repeats for the intra individual variation, or the highest, depending on how critical you want to be) and then set a cut off point for the ratio between those two.

ADD COMMENT
0
Entering edit mode

Thanks for your input! Until now I was actually using the SD mostly though your suggestion of moving to standard error actually makes more sense! Especially the ratio between both errors seems to work pretty well for creating blacklists.

ADD REPLY
0
Entering edit mode

blacklists of genes for microarray analysis are a bad idea. Take it from me.

ADD REPLY
0
Entering edit mode

@Lyco why would that be bad? Some genes are indeed very variable in certain experiments, like the abundant reticulocyte genes in PBMC experiments. I don't think it is a bad idea to leave them out. Although we often ignore the whole pathway in the end, instead of the individual genes.

ADD REPLY
0
Entering edit mode

If a gene has a high within-group variability, it will not be found as significant in the in-between-group test. I see no need in a general "good gene"/"bad gene" binning. First, two states might not be a good match for reality. Second, a "bad gene" in one experiment might be a "good gene" in another one.

ADD REPLY
0
Entering edit mode

Unfortunately withing group variability is not always randomly distributed over the two groups. In the example of reticulocyte genes in PBMC sample one of the two groups might just contain more of these cells and that would seem to cause a systematic between group effect. In such a case blacklisting might help and in any case it would not hurt. I think you are right that generalized blacklists will not work, but I don't think that is what Sander wants to do.

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6