Question

A Universal Format For Expression Data?

1

Entering edit mode

12.8 years ago

Zjk ▴ 10

I read Timtico's answer here.

And I wonder why not a universal format for expression data?

For every sample:

geneID  mRNA quantity
gene0   N0
gene1   N1
gene2   N2
...

is what we really want to know.

If I understand it right, the coverage of mapping in RNASeq is the Ni. What about microarry data? Can we convert it into this format?

gene • 2.6k views

ADD COMMENT • link updated 12.8 years ago by Chris Evelo 10k • written 12.8 years ago by Zjk ▴ 10

1

Entering edit mode

You have given the answer yourself in your example, you ask for a universal format. That would cover all possible use-cases. Your example to the contrary is completely reduced and simplistic. Think about the myriads of interesting questions your format doesn't cover (exons, introns, arbitrary regions, intergenic regions, other eg. small RNA, protein expression, discrete counts vs. ratios, the need for sample annotation,... ) and try to add these to your format.

ADD REPLY • link 12.8 years ago by Michael 54k

0

Entering edit mode

if you think about it, even the Ns represent coverage or counts, you still need normalization or other procedures before you compare the values between samples. It's all about statistics!

ADD REPLY • link 12.8 years ago by Vitis ★ 2.5k

score 11 · Answer 1 · 2011-07-20

11

Entering edit mode

12.8 years ago

biobot 0.0.77.a.1099 6.2k

By coincidence, today's XKCD explains why there is no universal format for expression data.

Devising even simple data formats is a complex task and the more general the application, the harder it gets to elegantly incorporate everyone's use-cases.

Most expression microarray experiments measure relative intensities between different conditions/treatments (which we interpret as being related to relative RNA levels), so I would not expect to know an "mRNA quantity" unless I had used spike-in controls. Even then, a microarray experiment is not the way to quantitate RNA.

ADD COMMENT • link 12.8 years ago by biobot 0.0.77.a.1099 6.2k

2

Entering edit mode

My computer doesn't prefer XML. Just a matter of proper education.

ADD REPLY • link 12.8 years ago by Lyco ★ 2.3k

1

Entering edit mode

The problem is that biologist want text files but computers prefer XML files

ADD REPLY • link 12.8 years ago by Pasta ★ 1.3k

score 5 · Answer 2 · 2011-07-20

Well the XKCD answer mentioned by Keith is good to have in mind, but of course not a reason to give up all efforts to create standards or to try to improve data and software interoperability.

Keith is right though that microarray data do not easily give absolute expression values and it is almost impossible to compare absolute expression values across experiments and across technologies. Microarray data are indeed more suited for comparisons between conditions than for absolute expression values.

But then you can of course ask zjk's question again. Can't you easily create a standardized format or at least a standardized content for microarray expression data comparison? Of course in that case you would need to specify what conditions (and thus what groups of arrays) you would want to compare. And that asks for another type of standardization: the experiment description files. In fact there are interesting efforts for that like the now almost traditional MageTab format or the new Isa-tab that also covers other types of omics.

The standardized result should be based on standardized data treatment for QC, normalization and such and should then give you a p-value for the statistics (an ANOVA if you have more than one comparison is a good start) and the various fold changes for the comparisons. This plus the probeset identifiers (which can easily be resolved to genes, but which keeps the raw information), was precisely what we stored as a network wide standard in [?]NuGO[?]. The same type of information is also provided by [?]Gene Expression Atlas[?]. Since that is a repository of selected (from Arrayexpress) good quality expression studies with data treatment and result storage in a standardized way, I would suggest to use the Gene Expression Atlas data format as the de facto standard for microarray comparison data, unless you have a really good reason to do otherwise. In fact that is what we plan to do in the [?]dbNP[?] project.