Entering edit mode

2.9 years ago

hcv
•
0

For an assignment, I am analysing data from accession GSE54536. However, when considering the adjusted p-value, no DEGs are found. For the next assignment, however, I need a list of DEGs from this dataset. Should I not look at the adjusted p-value but just the p-value? Thanks in advance for any help.

Could you expand on the reasoning that led to you log-transforming the dataset, please. That is, what was it about the "distribution of the data in the Value Distribution section" that led you to log-transforming?

There's a couple of flags in there for me: The box-plot given within the Value Distribution indicates lower-quartiles around zero for most of the samples in the version I'm looking at; it's less than zero for a couple of the samples.

Unless you made it yourself, you can rarely be sure where the whiskers end on a box plot, but the lower whisker will lie somewhere within the range of the data and it is negative for every one of those samples.

If there's a negative value in a dataset, what will happen to it upon log-transformation?

@OP: what was your cutoff? Regarding data transformation and consequent results, i suggest to you to read methods in the paper and reproduce author's results (https://www.ncbi.nlm.nih.gov/pubmed/24804238) by going through their publication This would help you to put analysis in perspective.

Would following the author's protocols for working with the raw data be of much value if the data have been transformed before upload to GEO? I'd strongly suggest that OP reads some of the metadata for the GSMxxxx files first to work out what manipulations were performed before upload.

The only metadata in the files is 'normalized signal intensity' which is clear from the box plot since values are comparable. Upon log-transformation of the data, NaNs are produced, but also the data is normally distributed, which is needed for the statistical tests performed in limma. My guess is I should focus on the p-value rather than the adjusted p-value since the adjustment of the p-value for DEG selection is not specified in the M&M of the paper.

Why are you log transforming normalized signal intensities...is it not necessary?

What is your cutoff adj. p-value to define a DEG?

You have to be careful with your wording here:

When you say that the data is normally distributed following log-transformation, do you mean the 'distribution of intensities across all probes for a given sample' or the 'distribution of intensities for a given probe across all samples' was normally distributed?

(I suspect you've looked at the former, which is kind of irrelevant to a statistical model that applies across-samples)

Fundamentally, limma doesn't require that the data is normally distributed across all the samples. It is an assumption of the model that there is some normally-distributed noise around the fitted values. That is, in an A-vs-B experiment like this, the values for a given probe should be (approx) normally distributed within group A, and also within group B (they don't need to look normally distributed in the amalgamation of groups A and B). So if you want to check whether your data are ok to be put into limma, you should look at a few different probes and for each of those probes do a box-plot, split by the experimental arms.

Theoretically, the reason for working with logged-data, is because fold-changes (multiplicative differences between groups) in the original space correspond to additive differences between groups in the logged data; and linear models estimate additive differences between groups. So if you're applying linear models to data that's been logged two times, the coefficients you estimate no longer correspond to a fold-change in the original space - so you HAVE to know whether the data has been transformed.

Practically, in this dataset - if you're transforming your data into a bunch of missing values, you're probably applying the wrong transformation.

As a rule of thumb, if you look at a microarray dataset and it contains negative values, or the maximum value is much lower than 10000, or the difference between the max and min values is less than a thousand, it's probably been log-transformed already.

Good luck with your work. But as another rule of thumb - check the published paper

afteryou've analysed the dataset (correctly) yourself. You'd be surprised how many papers could have been a lot better .. (as an aside, I have no experience or connection with this dataset or paper, which may well be perfectly good).A third rule of thumb: all cut-offs are arbitrary. Any downstream analysis you do could be critically dependent upon an arbitrarily set significance threshold at an early step in your analysis, so if you do plan to do GO/KEGG/genefriends/IPA type stuff cut the data at multiple thresholds and check that your downstream analysis is robust to your arbitrary cutpoints.

Thank you very much for your elaborate reply, it's helping me tremendously to understand precisely what I am and what I should be doing.