Batch Correction and Batch Size
0
0
Entering edit mode
9.5 years ago

I have pooled 123 samples together from two GEO antibody microarray studies, which used the same platform. I downloaded the raw .gpr files and opened each one in Excel to get the scan date of each sample (presumably represented by the variable DateTime), which I recorded in another excel sheet.

My understanding is that if two samples have a different scan date, they are from different batches. If so, then the 123 samples breakdown into the following batches:

Batch 1: 4 samples
Batch 2: 2 samples
Batch 3: 2 samples
Batch 4: 4 samples
Batch 5: 4 samples
Batch 6: 8 samples
Batch 7: 8 samples
Batch 8: 8 samples
Batch 9: 8 samples
Batch 10: 8 samples
Batch 11: 12 samples
Batch 12: 7 samples
Batch 13: 12 samples
Batch 14: 3 samples
Batch 15: 6 samples
Batch 16: 2 samples
Batch 17: 4 samples
Batch 18: 1 sample
Batch 19: 3 samples
Batch 20: 3 samples
Batch 21: 4 samples
Batch 22: 2 samples
Batch  23: 4 samples
Batch 24: 2 samples
Batch 25: 2 samples

Should I keep the above delineation of batches or should I combine small batches? Any advice?

Also, Batches 1-14 were from 11/9/2010 - 12/17/2010, while batches 15-25 were from 3/23/2012 - 4/27/2012.

Thanks,
James

Batch-Adjustment Microarray Batch-Correction • 3.0k views
ADD COMMENT
1
Entering edit mode

Make a PCA plot and/or cluster the sample and see how they group. That's usually an effective way to gauge batch effects. Also, have a look at combat() in the SVA Bioconductor package.

ADD REPLY
2
Entering edit mode

Also, never, ever use Excel for anything bioinformatics.

ADD REPLY
0
Entering edit mode

Thanks for the response, but that didn't really answer my question. Although I probably wasn't the most clear. Basically:

  1. Am I correct in organizing the 123 samples into 25 batches in the way that I did? Since posting this question, I've realized each sample's .gpr file has, along with DateTime, a GalFile variable with values such as: GalFile = C:\Users\Genepix\Desktop\ProtoArray\HA20251.gal. The item of interest here is the HA20251, which I recalled seeing somewhere in the provided .xls workbook of processed data as a "lot number". Should I consider a batch to be "samples with the same lot number" (i.e. 1 batch would be all the samples with "HA20251" in their .gal file address), or should I keep my batch definition to "all samples with the same day in their DateTime variable".

    Essentially, I'm hoping to extract from the provided data files an explicit batch identification for each sample to be used in a Target file in order to upload the data into the PAA R package to then apply batch adjustment. If I can't get explicit batch identifiers (which I think I can), then I'll need algorithms to "discover" batch effects.

  2. Assuming I was correct in organizing the 123 samples into 25 batches the way that I did, is it problematic to have batches of size 1 and 2? Is there a motivation for combining small batches with a nearby neighbor? For example, suppose 1 sample was scanned on Monday, and 7 samples were scanned on Tuesday, the day after. Would it make more sense to consider them as Batch1 = 1 sample, Batch2 = 7 samples, or to have all 8 samples in one batch?

ADD REPLY
0
Entering edit mode

I answered the question you should have asked, rather than the one you did ask :)

  1. The way you're doing it currently seems correct. Perhaps using instead HA20251/etc. as a batch identifier would work better, but the only way to know would be to contact the people who produced the data (or cluster things as I suggested earlier).
  2. Batches of size 1 end up becoming useless. A batch of size 2 may be useful, depending on whether the batch members are all from the same treatment group or not (it's better if they're not).
ADD REPLY
0
Entering edit mode

Good to know that helped a lot thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2352 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6