Hi, I did my analysis on GeneChip Mouse Gene ST 2.0 Array, Affymetrix. I compared expression profiles of 4 experimental groups and got some dissapointig and surprising results). At best 2 genes are differentially expressed (limma, adjusted p 0.05 - BH correction) between groups, and none at all between some. Biologically, it doesnt make much sense to get no difference (we are comparing inflammatory cells in joints of arthritic mice and control mice that dont have arthritis), but I guess the changes are not necesarily on transcription level. Now I find myself with all those microarrays and expression values, but the pipeline for analysis sort of stops here since for further pathway/functional analyses are all based on experimental group comparison. Is there a way to find out anything about this cells that we analyzed in general - not in the sense of between group comparison? Like are they activated inflammatory cells or not or something form the expression values? Is it possible to do some enrichement for gene sets that have high expression values even though they are not different between groups? Is it possible to compare the values within one array (gene of interest - housekeeping gene)? From what I know about microarrays, the analysis is based on comparison of experimental groups and expression values by istelf dont mean anything at all, so I realize this questions probably dont make sense but I wanted to double check if there is some way to put this enormous set of expression data into good use.
Thanks
Yes, synovial sample but murine synovia - cells were released from the knee joint by digestion with collagenase and myeloid (CD11b+ Gr-1+) population was sorted by FACS and used for microarray analysis.
I included 14.414 genes in limma analysis. Only main probes were included, I removed those without entrez ID, .. I attached the R code bellow:
I also excluded probes with low variability based on row SDS
I also tried independant hypothesis weighting (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4930141/) instead of prefiltering the dataset with row SDS, but it gave the same result.
Adjusted p values in my most relavant group comparison for top 10 genes (from lowest - highest): 4.60E-06, 4.60E-06, 0.137408915 0.164786542 0.226554087 0.226554087 0.304023382 0.432896157 0.432896157 0.572612477 So 15% FDR would add just one other gene. Mostly it looks like the groups are not different between eachother at all.
Would you recomend another strategy to filter the genes?
n.lu : Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.This should have been a comment against @Kevin's answer.
This is a silly question, but what are the top genes that have passed FDR? They are obviously not what you would expect(?)
There is a big gap between the top 2 genes and the others, so, it does not look correct. How did the nomalised data look in a boxplot and histogram?
We have a knockout for a gene involved in apoptosis, and those mice dont develop experimental arthritis like wild type mice. So those were my 4 groups - wt mice control (no arthirits), wt mice with arthritis, knockout with arthritis, knockout control. wild type control vs arthritis show no differentially expressed genes at all (which is very strange). The comparison that we were most interested in was wild type arthritis vs knockout arthritis - and thats where the two genes mentioned above come up - Mid1 and Erdr1. They have been described in inflammatory conditions, Erdr1 even in murine arthritis, but I still found it wierd that no other genes acting together with these two would be changed. (The knockout gene has adjusted p value 0.13, logFC 1, but is not very highly expressed in those cells). If I try hierarchical clustering / PCA / NMF the samples dont group by experimental groups at all (and not by batch effect either).
When the hybridization of the chips was performed, there was a problem with the scanner, so the chips were stored (properly) from 3-5 days after hybridization and before scanning. My first thought was that that was the problem, but then QC looked fine - I will attach boxplots and histiograms from before and after normalization.
A couple of more things to check:
I didnt adjust for batch effect. PCA and clustering didnt show any gropuping by batch (half of the arrays were scanned on a different day). I also tried the algorithm for discovering unknown batches in SVA package and it foud 0 surrogate variables (although this is probably methodologicaly wrong, since I know my batches).
I am attahing the PCA plot below (group 1-4: WT_CTRL, WT_arthritis, KO_CTRL, KO_arthritis). The second graph is the same, I just marked them by the day they were scanned to check for batch.
Thank you very much for helping me with this!
No problem - interesting data.
What is the explained variance along PC1 and PC2? Outliers can affect the derived stats, too!
Also, what if you tried the contrasts separately for just your samples of interest? Sometimes I worry about complex design formulae when used with a dataset of low numbers of samples.
Here is how I do it with Limma (this for a recent Affymetrix Genechip HuGene ST 2.0 array):
In my data, my groups are defined in
targetinfo$Group
That's more or less the same as yours, but slightly different too.
Hi, thank you very much for sharing this script (and sorry for the late reply). The results are pretty much the same (same 2 genes significant, and than FDR jumps to 13% on the next one, and than to 50%).
For principal components:
Does that give you the info about explained variances?
Hey, yes, that gives the information regarding the proportion of variance. For PC1, it says that it's 16.88%, which is common for this type of data. For example, in my array experiment (3 conditions), and in which I found many genes differentially expressed, my PC1 was 13%.
The only other thing that I can think is to group your samples into 2 groups and just compare those. Your sample numbers right now are very low and the model you have been trying to do is too complex for the low sample numbers.
If that does nothing, then you may have to consider reducing FDR to 20 or 25%, even though that will not return many extra genes. Using nominal P values will probably mean that you will struggle to publish the study and/or you may have to advertise it as purely hypothesis-driven that requires major validation in other datasets.
Apart from that, check again to ensure that nothing went wrong during the wet-lab preparation. However, this can be tricky because what can happen is that you may enter a 'battle' of the wet-lab blaming the analyst, and vice-versa.
Thanks for the explanation.
I tried grouping them based on genotype (knockout vs wild type) and arthritis vs cotrol. First comparison gives 5 genes, the second none. One other thing I did was group them in two groups based on the wey they group in hierarchical clustering. The groups dont realy make sense biologicaly, samples seem like they are randomly mixed. I just wanted to check what makes them cluster this way.. And turns out this gives 800 differentially expressed genes, mostly involved in cell cycle regulation and similar. So maybe these cells really dont differ that much between groups, but have some other differences that are group independant. I guess we will try to confirm the 2 genes that we did find by qrtPCR, and then try to explore their role in our arthritis model if the results are correct.
The only part I was worried about in the wet lab is the delay between hybridization and scaning. Would you say there is a way to see if the fluorescent signal was poor?
Thank you very much for your help
Yes, there are some diagnostic methods, but I rarely have to use them because everything usually goes smoothly! The last time that I used these was on a very old Affymetrix chip (but they may still work!). I think that you may need to load the
affy
andgcrma
packages in order to use these.Chip images
The first basic thing to check is an image of each chip. Take a look here: Microarray image explanation Thi scan be done with:
If there is degradation of probes due to the assumed wet-lab issue, then you should see very faint signals of light on some chips compared to others.
Compare perfect and mismatch probes
(may not function at all with the latest chips, because they no longer use mismatch probes directly!)
Use arrayQualityMetrics
This outputs a whole bunch of stuff into a new directory and adds a index.html file that summarises everything.
Thanks, I'll check those as well. Thank you very much for your suggestions and comments!!!
I now just realised that the other thread on chip images was with you, too! Your chip images appeared to be fine, but maybe the arrayQualityMetrics will reveal something.
Unfortunately, as we move forward with NGS and other technologies, I find that microarrays are less and less common.
Hi! Thank you for sharing the information.
Yes, it is very worrying that wild type control vs arthritis showed nothing. I would expect the greatest differences here, particularly in the synovium.
Which model formula did you use for Limma? Sometimes, that can be the issue and result in very low numbers of statistically significantly expressed genes.
If something occurred with the scanner (or before that point), then there is not much that we can do as analysts...
Limma: