Help with RNA-seq analysis
2
0
Entering edit mode
5.3 years ago
mlai2567 • 0

Hello all,

I'm relatively new to bioinformatics, and am trained primarily as a molecular biologist, so please bear with me if my questions seem quite rudimentary.

I've received a dataset of RNA-sequence data that has already been converted from its native form to an Excel spreadsheet. It contains data from 2 conditions, one experimental and one control. Each is done in replicate, and both the raw values and the log transformed values are present. I'm trying to utilize this data to compile a list of differentially expressed genes, so my questions are as follows:

1) Is there a way in Excel where I can isolate the differentially expressed genes from the large dataset that I have now? Is there a way on Excel to statistically analyze the difference in expression of each gene, and only isolate the ones which have a difference in expression of p<0.05?

2) From this dataset, is there a way on Excel to individually isolate the genes which have been up-regulated and down-regulated separately?

Again, I apologize if these questions are basic, and would greatly appreciate any assistance from the community.

RNA-Seq rna-seq sequencing • 4.0k views
ADD COMMENT
5
Entering edit mode

Channeling Pierre Lindenbaum 's spirit,

via t@tim_yates

ADD REPLY
0
Entering edit mode

Although I understand R is probably a much more efficient method to do the tasks I've described, my experience with R and programming in general is quite limited. In the long term, I'm aiming to become proficient in the language, and would thus in the interim appreciate any suggestions for me to conduct the analyses on Excel.

ADD REPLY
5
Entering edit mode

Please don't. You cannot reproduce anything you do on Excel, and we cannot give you specific instructions to be implemented on Excel. What you have is a matrix of numbers, most probably, and you cannot run statistical analyses on Excel. Import the data into R and people here will be able to help you much better. Plus, all of your analyses can be recorded, reproduced and debugged. Excel is the worst possible way to do an analysis of this magnitude.

ADD REPLY
3
Entering edit mode

Sorry to say, but don't even try in Excel. RNA-seq analysis is quiet simply from the actual user's perspective because excellent standard software is available that provides the necessary statistical framework, but these are implemented in R. If you lack the knowledge, I recommend either working yourself into it, e.g. by following the DESeq2 guide plus the web for R help, or try to collaborate with an experienced R bioinformatician. Trying to put together custom solutions in Excel is not recommended, especially if you are not an expert in statistics.

ADD REPLY
0
Entering edit mode

I see. Thank you for the replies @ATpoint and @RamRS. I am currently looking into DESeq2. How would you guys suggest that I proceed from here? Will I have to train myself in R or should I jump to looking at the DESeq tutorial? My PI has asked for data analysis relatively soon.

ADD REPLY
1
Entering edit mode

You should learn R as you're working on DESeq2. Once you learn how to read data into a data.frame in R and then subset it by picking rows or columns, you can do anything in R that you can in Excel. DESeq2 might have its own objects, so the tutorial should walk you through that.

Most of bioinformatics is in R/python, you'll almost never need MATLAB.

ADD REPLY
0
Entering edit mode

Additionally, would I be able to perform the analyses in MATLAB, or is R preferred?

ADD REPLY
0
Entering edit mode

Have you tried using SeqGeq?

ADD REPLY
0
Entering edit mode

No, I haven't. Is it an open source software?

ADD REPLY
0
Entering edit mode

Doesn't look open source or even free.

ADD REPLY
0
Entering edit mode

Try BRB array tools to analyze RNAseq data in excel mlai2567. link to brbarray tools: https://brb.nci.nih.gov/BRB-ArrayTools/Documentation.html

ADD REPLY
1
Entering edit mode
5.3 years ago

Is there a way in Excel where I can isolate the differentially expressed genes from the large dataset that I have now? and only isolate the ones which have a difference in expression of p<0.05?

You can do grossly obvious things like t-test and ratios of means, but Excel does not understand the nature of the sources of error in a RNA-Seq experiment, so can't make a good p-value for those differences.

It's not at all clear to me that logging the raw counts is an adequate library normalization. Yet another reason to find software that is designed to work with RNAseq.

With an experimental set-up as simple as yours, you could try importing your data into Galaxy, and use DESeq there.

ADD COMMENT
1
Entering edit mode

I definitely agree that there is value in learning to code in R in the long term.

However, I have some disagreements about dismissing the most direct calculations (such as doing a t-test on log-transformed values):

1) I think the accuracy of the RNA-Seq methods (particularly with a limited number of replicates) has some limitations. So, especially for groups with duplicates or triplicates, I think it is very important to test different methods for each project.

2) I think having independently calculated log-transformed expression values is important to assess the p-value calculations. For example, I have seen situations where one method can miss a gene that is clearly differentially expressed (but in a way that varies between projects - so, you can't pick one "best" method for all projects), and I would be concerned that you may not notice this if you use the normalization from the program (if that caused the weird result). Also, I like to be able to visualize expression in samples not used for differential expression (like "validation" for a limited number of samples), and I think having something calculated outside the differential expression program can be helpful for that.

That said, I think the t-test will be less sensitive than edgeR / DESeq2 / limma-voom. So, if you don't have a clear expression change, the lack of difference with the standard log-transformed expression may be a false negative (so, in that situation, I would agree with the concern about the t-test on log-transformed values). However, on the flip side, if you want to decrease sensitivity, it may be worth considering a more standard test on log-transformed expression.

Also, "thank you" for the Galaxy suggestion: I am currently trying to figure out what to suggest to people to give them more autonomy in analysis when they don't have much coding experience (while also allowing them to test different open-source programs for each project).

ADD REPLY
0
Entering edit mode

Thank you for the replies! I'm currently trying to use the Galaxy platform to perform my DESeq analysis. However, I am unsure of how to convert my Excel data-sheet into a suitable file for Galaxy analysis, and how to setup the correct parameters. Would someone kindly be able to guide me through this process?

ADD REPLY
2
Entering edit mode

This is getting to simple file format territory now. Please use Google to look for terms like "Excel to tab-separated plain text" or "Excel to comma separated plain text", or "Excel to " < whatever format Galaxy wants >.

If things are not convertible straight away (for example, if you need a bed file), look up the format specifications for the particular format and you'll see that it is a delimited file of some sort with each field containing a specific data item, and you should be able to do that yourself.

Please try to get ahold of a bioinformatician around you.

ADD REPLY
2
Entering edit mode

mlai2567 : You can check Galaxy RNAseq analysis training available at:

Or this video tutorial

ADD REPLY
0
Entering edit mode
5.3 years ago
mlai2567 • 0

I've come across the following web-based application which is fitting my purposes nicely: https://gallery.shinyapps.io/DEApp/

A question to you guys. What are the primary differences between edgeR, Limma-Voom, and DeSEQ? Is one more widely used than the others? For publication, would reviewers prefer any one method?

ADD COMMENT
1
Entering edit mode

My recommendation would be to test each method (edgeR / limma-voom / DESeq2) for each project.

If the gene counts are similar, it might not matter too much which method you use. However, I think you'll find differences a non-trivial percent of the time. For per-project method evaluations, I think the size of the gene lists and/or visualization with an independently calculated expression value can help you determine what is most appropriate for your particular project.

Your downstream results should also inform decisions about what upstream processing steps are used for publications (but that depends upon biological knowledge about your area of interest, and is therefore something that is more difficult for me to give guidance about).

ADD REPLY
0
Entering edit mode

Have you tried searching online? edgeR vs DESeq2 or edgeR vs limma voom would give you starting points to read up on.

ADD REPLY
0
Entering edit mode

From what I've gathered at this point, it seems to me that each platform performs essentially the same task using different statistical assumptions?

On the basis that each method will probably produce slight variations in data, is it necessary to utilize all methods for analysis, and rationalize their differences, or would one method be sufficient? In terms of analysis for publication.

ADD REPLY
1
Entering edit mode

Well, you need to invest time in understanding your data and your experiment setup, then into looking into the differences among the tools and then into making a choice that you can justify in a publication.

ADD REPLY

Login before adding your answer.

Traffic: 2900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6