Question

Forum:ExpressionSet in the age of tidyverse

12

Entering edit mode

5.9 years ago

igor 13k

The ExpressionSet class is supposed to be a standardized data structures to represent genomic data. It was introduced many years ago and was an improvement over the standard R data frames and base R data manipulation methods. However, is it still as big of an improvement today? Ever since I discovered dplyr and the rest of tidyverse, I have been much happier with my R experience. If I encounter an ExpressionSet object, I often end up having to extract the relevant parts, manipulating them externally, and then importing them back in. That workflow seems convoluted and feels like completely negating the whole purpose of the ExpressionSet in the first place. Am I doing it wrong? Are there benefits of ExpressionSet that I am not thinking of?

expressionset R bioconductor • 3.3k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.9 years ago by igor 13k

3

Entering edit mode

Benefits other than the fact that a LOT of packages are meant to deal with an ExpressionSet? The only real benefit other than that is memory, since it's going to be (slightly) more memory efficient than the equivalent tidy representation (due to the phenoData being duplicated, though it's presumably a factor so you're just duplicating an integer index).

As a side note, R (like Fortran and matlab) uses column-major ordering, meaning that values in a column of a matrix are next to each other in memory. So it's not like all the values needed for a computation are conveniently layed out next to each other in memory.

ADD REPLY • link 5.9 years ago by Devon Ryan 104k

1

Entering edit mode

I agree a lot of packages use ExpressionSet, but many don't. So you still will need to convert to or from ExpressionSet eventually.

That's interesting about the column-major ordering. That may explain why many base R approaches are column-centric and all genomic matrices have to be transposed (prcomp is probably the most famous example).

ADD REPLY • link 5.9 years ago by igor 13k

2

Entering edit mode

what kinds of manipulations do you find yourself doing to them?

I can somewhat echo your sentiment, but in my case most of the manipulations are due because I want to make use of ggplot, which requires the long, skinny format rather than the matrix format which ExpressionSet has mastered.

I feel the tidyverse has trained people really well to embrace the long format so people (including myself) tend to forget about many awesome, matrix-based functions (e.g., rowMeans and the like). To me tidyverse and ExpressionSet represent two different mindsets, and I agree that the tidyverse should probably come up with a solution for metadata storage, too.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

I am not sure if any specific manipulations are especially common. Some easy examples: filter(gene_name %in% some_genes) or select(-contains("WT")). Yes, you can rewrite them in base R format, but that feels more awkward to me. Obviously, your ggplot example is even more valid.

Yes, rowMeans is great. But if your data is already in the long format, you can pipe to summarize and then apply any function.

Similarly, if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works.

ADD REPLY • link 5.9 years ago by igor 13k

0

Entering edit mode

"if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works." One long format tibble cannot include everything from an ExpressionSet object. One solution is to get a table from ExpressionSet, do sth. like stat, plotting, then write all revised information into the called table in an ExpressionSet.

ADD REPLY • link 5.9 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

it's not just that storing the metadata in skinny format isn't elegant, it's more space-consuming as the others have alluded to, too. My approach tends to be similar to data bases, using data.tables and so on: I have a small metainfo data.table, similar to what would be stored in phenoData and merge the corresponding information when needed. That is nowhere near the neat structure of ExpressionSet though. For me, storing data in long format often feels like a waste of memory for those reasons, which is why I try to find the optimal solution for every calculation and try to generate the skinny data.tables only on the fly for piping them into ggplot. That being said, I'm definitely guilty of ending up with way too many skinny data.tables at the end of a day.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

1

Entering edit mode

Possibly relevant projects:

ADD REPLY • link 4.5 years ago by mikemc ▴ 10

score 2 · Answer 1 · 2018-05-11

I think this is a really good call.

If you look at how tidygraph is organised, there are some ideas in there that could scale to the manipulation of other multi-table datastructures.

The only real issues are that there are multiple tables that are relevant (when plotting, manipulating etc), that none of these tables (assayData, phenoData and featureData) are tidy, and that naively combining these tables would make a single, giant, but tidy, dataframe.

That is, I could write you a function this morning that takes this:

{assayData = [G x N matrix]; phenoData = [N X P data-frame]; featureData = [G X M data-frame]; each component indexed by rownames and colnames}

into this:

eset_tibble = [(GN) x (P + M + 3) tbl-dataframe]

Like this: eset_tibble <- as_data_frame(eset)

Where in addition to the columns in phenoData and featureData, the latter has a column to indicate each of feature_id, sample_id (ie, the dimnames from the original ExpressionSet) and the expression value for feature f in sample s. That would be unwieldy to work with, would have tons of duplicated information, and would not be useable in bioconductor tools that expect an ExpressionSet. So what value would it have? It woulld be more compatible with ggplot2, dplyr and tidyr; and so you open up a lot of easy plotting and summarisation tools.

That's not a good enough trade-off for me. But if the expressionSet is filtered / subset before converting it into a tidy dataframe then the ggplot-able benefits start to outweigh the possibility of crashing R every time you attempt this.

Given a bit more time, filter and select etc could be written to work with ExpressionSet's directly: I can see a value in using tidyverse functions on bioconductor objects even if the latter don't conform to tidy* standards.

Take filter though. How would it work in this setting? If it's written to manipulate ExpressionSets, then it should return an ExpressionSet. But should it modify G (the number of features) or should it modify N (the number of samples). In a typical use case I might want to extract the data for some genes, and make a boxplot with-respect to some phenotypic factor. In another, I might want to extract a subset of samples and compute a summary over all the features for those samples. in another I might want to filter on both the samples and the features. So filter should be able to modify both G and N (and should modify all tables within an eset to respect the changes). Similarly, select should be able to modify both the columns of featureData and the columns of phenoData according to user requirements.

tidygraph deals with a similar issue. It seems to store graphs as both a nodes data-frame and an edges data-frame. In a workflow, you may want to manipulate both the nodes data-frame (add node-level statistics to it, say) and the edges data-frame (to add or remove edges, or something). To do that you use the same verbs as if you were working with a single data-frame in dplyr. But, tidygraph adds the function activate so that you can tell it whether to manipulate the edges data-frame or the nodes data-frame.

I can see how this approach would be pretty valuable for manipulating ExpressionSets. So the hypothetical user interface would look something like:

my_eset %>%
# Extract a sub-expressionSet related to gene - this should reduce the rows in both assayData and featureData
    activate(features) %>%
    filter(feature_id == "ENSG0000001234567") %>%
# Extract some independent variables for the samples - this should reduce the columns in phenoData
    activate(samples) %>%
    select(my_important_factor, the_covariate_that_makes_it_meaningless) %>%
# Convert the ExpressionSet to a tibble
    as_data_frame() %>%
# Plot it out
    ggplot(aes(x = my_important_factor, y = expression_level)) +
    geom_boxplot() +
    facet_wrap(~ the_covariate_that_makes_it_meaningless)

The above should give the same results as the memory killing (but again, hypothetical):

my_eset %>% 
    as_data_frame() %>%
    filter(feature_id =  "ENSG0000001234567") %>%
    select(feature_id, sample_id, my_important_factor, the_covariate_that_makes_it_meaningless) %>%
    ggplot(aes(x = my_important_factor, y = expression_level)) +
    geom_boxplot() +
    facet_wrap(~ the_covariate_that_makes_it_meaningless)