Question

kallisto/sleuth for single sample pathway enrichment

0

Entering edit mode

6.2 years ago

endre.sebestyen ▴ 10

Hi,

I'm trying to do a single sample pathway enrichment analysis with Kallisto/Sleuth. I have 3 control samples, and 3 mutated samples. I have good reasons to believe that the mutated samples have a larger number of genes/pathways differentially expressed in each sample individually, which masks a core set of genes or pathways, that are differentially regulated in all 3. I'm interested in both the common set of pathways and the sample specific ones, so simply comparing 3 control vs 3 mutated won't do it.

I was thinking about comparing the 3 control samples to the mutated samples one-by-one, to define mutated sample specific differentially expressed genes. I estimated transcript level expression with Kallisto, and used Sleuth to aggregate data at the gene level and do the usual differential expression with 3 controls vs 21 mutated sample. I have 3 lists of differentially expressed genes. So far so good (even though the results might not be super reliable).

However, I would really like to do a pathway level analysis with Sleuth instead of the gene level analysis. As Sleuth is working with transcript level data, I had to supply a transcript -> gene table, so it could aggregate transcript level data into gene level data. I can generate a transcript -> pathway table, for example with MSigDB/Reactome sets. However, many genes are part of several pathways, and Sleuth fails at the aggregation step.

reading in kallisto results
dropping unused factor levels
....
normalizing est_counts
88212 targets passed the filter
normalizing tpm
merging in metadata
aggregating by column: pathway
15688 genes passed the filter
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  :
Join results in 15599004 rows; more than 4701355 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I'm trying to figure out what to do with this, and I would appreciate any feedback or comments.

Is it a reasonable approach at all to compare the 3 control replicates to single mutated samples?
How would you do the aggregation where genes/transcripts belong to multiple pathways?

Thanks,

Endre

RNA-Seq Sleuth Kallisto Pathway enrichment • 3.0k views

ADD COMMENT • link updated 6.2 years ago by Rik Verdonck ▴ 50 • written 6.2 years ago by endre.sebestyen ▴ 10

0

Entering edit mode

You should not aggregate gene to pathway levels (that does not work since a gene is part of many pathways) instead you should use gene-set analysis tools. The easiest to use in R is probably gProfileR

ADD REPLY • link 6.2 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Yes, that's one of the questions. How to aggregate when a gene belongs to many pathways? :) Lior Pachter wrote some tweets a while ago, that they used kallisto/sleuth for pathway level analysis. Later they had a preprint, where they aggregated transcript level data to do GO enrichment analysis, using sleuth p-values, the Lancaster p-value aggregation and BH correction. This is similar to what I want to do, but not exactly the same and motivated me to think about pathway aggregation.

ADD REPLY • link 6.2 years ago by endre.sebestyen ▴ 10

score 0 · Answer 1 · 2018-02-22

Hi Endre,

As an answer to your first question: it depends on what you want to know. If your individual mutated samples are of interest to you, then you should certainly compare them to your best guess of a normal expression pattern (i.e. some kind of mean or median of your controls).

If, however, they are not individually interesting, then I don't see why you would individually compare them to the control situation. If, for example, pathway x is "up" in mutated sample 1, but not in the other 2, what would you infer from that? How does it generalize? How do you know it's not a chance event?

What you could perhaps do, is to use measures of stability or variability in an attempt to demonstrate that in mutated samples, the variance in expression increases on the pathway level. I don't immediately see how this may work in practice though. I guess you could start with a principal component analysis on something like eigengenes (check WGCNA) and see if your control samples consistently end up closer together than your mutated ones.

Potential suggestion for your second question: not sure how to proceed with sleuth, but have a look at this. There are ways to check for enrichment of certain pathways (or other groupings of genes) based on statistics like fold change, or p-value, even if the pathways fall in hierarchies and memberships are fuzzy.

Best, Rik