Question

Analysis Without Replicates

1

Entering edit mode

5.4 years ago

stevenmarkgalway ▴ 20

I am carrying out an RNA-seq analysis on mesenchymal stem cells retrieved from disease and control state. However, the exp is without replicates (bad idea I know). The analysis is exploratory, to identify potential factors that can be validated in the lab. So far, I performed transcript quantification using salmon and summarised output to gene level using tximport. So now I have a counts matrix. Just to clarify my only real approach at this point is to rank genes based on a transformed or regularized LogFC?

RNA-Seq Without Replicates • 3.3k views

ADD COMMENT • link 5.4 years ago by stevenmarkgalway ▴ 20

score 2 · Answer 1 · 2018-12-09

2

Entering edit mode

5.4 years ago

Kevin Blighe 87k

EdgeR claims to be able to derive p-values for single-replicate experiments. Obviously, 3 replicates is ideal; better with 4 or 5.

Things to try:

look at max and min expressed transcripts by normalised counts
look at fold-change ratios between the replicates (as you mentioned)

I trust that at least these have been processed in the same batch and in exactly the same way. The last thing that you want is a batch effect with a n=2!

Kevin

ADD COMMENT • link 3.6 years ago by Kevin Blighe 87k

1

Entering edit mode

Thanks for your feedback guys, much appreciated. Yeah, I have absolutely no replicates and yes they have been processed in the same batch. I did try duplicating the data for DESeq2 analysis, but no joy. I think going of estimated p-values for experiment without replicates is prob not a great idea anyway. So I am looking at the fold change for indication of significant difference. I found that when I rank based on log2fold change I have quite large log2fold change values up to as far as +/- 8 and 9, with many highly ranked genes having lower counts compared to lower ranked genes. The ranking seems look a lot better when I apply DESeq's rlog for a regularised logfold change. From here I am applying a 0.6 cut off to obtain genes that have a normal log fold change of 1.5 or greater. From this I have ~200 up-regulated genes. I plan to use signalP to pull out secreted proteins that may be validated in the lab. Also ill be using IPA for pathway analysis, causal network analysis and building context around genes of interest! I am guessing this is the best I can do with what I have.

ADD REPLY • link 5.4 years ago by stevenmarkgalway ▴ 20

2

Entering edit mode

You choose 0.6 presumably because log2(1.5) = 0.5849625 ? Did you try to apply the fold change shrinkage method of DESeq2, too? In any case, it seems like you know what you are doing.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

That's exactly right Kevin, it's the Log2(1.5). After reading around I gathered some ideas for what I might be able to do in this situation. I guess being new to this kind of analysis there is a degree of uncertainty on my behalf. I just wanted to see what people thought. Thanks a lot for your input. I'll be sure to check out DESeq2s LFC shrinkage method.

ADD REPLY • link 5.4 years ago by stevenmarkgalway ▴ 20

1

Entering edit mode

In case the shrinkage does not give satisfying results and you have to use unshrunken FCs, I would suggest to use a decent read cutoff to avoid high fold changes due to small counts ("mean-variance relationship").

ADD REPLY • link 5.4 years ago by ATpoint 81k

0

Entering edit mode

It seems I can't run the LFC shrinkage without first running the DESeq function, which in-turn requires replicates - "Treating samples as replicates was deprecated in v1.20" ~ output from DESeq2. So I will stick with the regularised log fold change. I was considering a high read count cutoff, but I absolutely have no idea how high or low I should draw the cutoff. I really hate asking for advice on choosing arbitrary cutoffs. However, I was thinking of summing up the rows across the two samples and applying a cutoff of 50. So genes that have a summed read count greater than 50 across the two samples would be retained for downstream analysis. I seen this approach in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934518/ . Would a summed read count cutoff of 50 be "decent"? I really am in the dark with this one.

ADD REPLY • link 5.4 years ago by stevenmarkgalway ▴ 20

1

Entering edit mode

There is no standard cut-off at all. All that you need to ensure is that you remain transparent about it in your methods. Could be 20, 30, 50, ..., 5, ..., 100, ...

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Sounds like a good plan, especially with the log2FC shrinkage - just remember log2FC != pvalues

ADD REPLY • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Older DESeq2 version were able to do it. Unfortunately they removed that ability from recent versions. Still as Kevin said those aren't recommended. You could try duplicating the data and then get some results, but still it will be bogus.

ADD REPLY • link 5.4 years ago by piyushjo ▴ 700

3

Entering edit mode

Don’t ever duplicate results to get pseudoreplicates. See miRNA-seq DE analysis with duplicate replicates. Am i correct?

ADD REPLY • link 5.4 years ago by ATpoint 81k

1

Entering edit mode

Didn't mean to suggest that as the correct method, but rather a method to get top genes to validate using qPCR. That data is not publishable with pseudoreplicates.

ADD REPLY • link 5.4 years ago by piyushjo ▴ 700

2

Entering edit mode

Still, I do not think that creating pseudoreplicates gives you anything meaningful because you artificially create data without variance and therefore excessive false-positives and probably also false fold-changes. For qPCR candidates, I would use those that have reasonable fold log2FCs and proper mean expression as discussed above.

ADD REPLY • link 5.4 years ago by ATpoint 81k

0

Entering edit mode

OK something that I am trying to get my head around. Given this scenario: I have a read count for GeneA for both control and case samples (no replicates). Gene A(control) has a read count =2, GeneA(case) has a read count=1000. That is approximately a 9 fold increase for GeneA(case) at the log2scale (which is massive). So when you say "I would suggest to use a decent read cutoff to avoid high fold changes due to small counts" you are not referring to the above scenario. However, your are referring to a scenario where Gene B(control) has a read count =2, GeneB(case) has a read count=30, which returns a relatively high Log2fold change ~4, based on low counts?

ADD REPLY • link 5.4 years ago by stevenmarkgalway ▴ 20

1

Entering edit mode

Exactly. The situation with 2 vs 1000 is probably/maybe a true biological effect, but 2 vs. 30 is rather the usual noise that comes from low abundances.

ADD REPLY • link 5.4 years ago by ATpoint 81k