I'm trying to do a single sample pathway enrichment analysis with Kallisto/Sleuth. I have 3 control samples, and 3 mutated samples. I have good reasons to believe that the mutated samples have a larger number of genes/pathways differentially expressed in each sample individually, which masks a core set of genes or pathways, that are differentially regulated in all 3. I'm interested in both the common set of pathways and the sample specific ones, so simply comparing 3 control vs 3 mutated won't do it.
I was thinking about comparing the 3 control samples to the mutated samples one-by-one, to define mutated sample specific differentially expressed genes. I estimated transcript level expression with Kallisto, and used Sleuth to aggregate data at the gene level and do the usual differential expression with 3 controls vs 21 mutated sample. I have 3 lists of differentially expressed genes. So far so good (even though the results might not be super reliable).
However, I would really like to do a pathway level analysis with Sleuth instead of the gene level analysis. As Sleuth is working with transcript level data, I had to supply a transcript -> gene table, so it could aggregate transcript level data into gene level data. I can generate a transcript -> pathway table, for example with MSigDB/Reactome sets. However, many genes are part of several pathways, and Sleuth fails at the aggregation step.
reading in kallisto results
dropping unused factor levels
88212 targets passed the filter
merging in metadata
aggregating by column: pathway
15688 genes passed the filter
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 15599004 rows; more than 4701355 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I'm trying to figure out what to do with this, and I would appreciate any feedback or comments.
- Is it a reasonable approach at all to compare the 3 control replicates to single mutated samples?
- How would you do the aggregation where genes/transcripts belong to multiple pathways?