Question

RNASeq TCGA data

1

Entering edit mode

6.6 years ago

elizabethR ▴ 70

Hey

I have been analysing some of the RNASeq data from TCGA. I have been using total RNA files from RNASeqV2 on the advice of my supervisor. However there are also other types of RNASeq, including exons, splice junctions and isoforms. I've not been able to find somewhere that explains what you can do with this data and how you can use it. Can anyone enlighten me? My supervisor is reluctant to discuss it, saying I don't need it but I'd just really like to understand what the entire dataset consists of! Thanks in advance :)

RNA-Seq TCGA • 2.5k views

ADD COMMENT • link 6.6 years ago by elizabethR ▴ 70

0

Entering edit mode

Thank you. To date I have been using the raw (aligned) counts from the rsem.genes (non normalised) RNASeqV2 total RNA data files. I had assimilated them from cancer and normal tissue samples to do DE analysis using edgeR... is that ok?

ADD REPLY • link 6.6 years ago by elizabethR ▴ 70

1

Entering edit mode

Updated 9th March, 2019

If you start with non-normalised RSEM counts, then I would import these to DESeq2 via tximport. There is information on this process in the DESeq2 vignette.

If you want to start with the normalised RSEM counts, then may I direct you to this published manuscript in BMC Genomics: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461714/

In their methods, they have the following paragraph:

Next, public data for lung squamous cell carcinoma (LUSC RNA-seq v2 dataset) from The Cancer Genome Atlas (TCGA) Research Network (http://cancergenome.nih.gov) was used as a reference dataset. This dataset, generated by highly standardized procedures with thorough quality control, contains RNA-seq data from 51 normal and 502 tumour samples. Due to the high number of samples, around 70% of all genes were found significantly differentially expressed (FDR < 0.01) after limma analysis with voom correction [38]. Therefore, only the top significant genes were considered as reference. Here, limma was used instead of edgeR as raw counts were not accessible via the TCGA repository at the time.

So, for RNA-seq v2 / RSEM, they say that they preferred to use limma due to the fact that they did not have access to the raw couints. Like DESeq2, EdgeR requires raw counts as input, not already-normalised [by some other method] counts, like you have.

In conclusion:

If you start with the RSEM normalised counts, use limma for performing the differential expression analysis directly on these, and refer to that BMC Genomics manuscript to validate your choice.
if you start with RSEM non-normalised counts, import these to DESeq2 via tximport.

There is also a good article here, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/, but they don't appear to mention RNA-seq v2.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2017-09-16

Hey Elizabeth,

Yes, RNA-seq V2 produces a few different files - these relate to the RSEM method. See here for further information: https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2 Unless you are aiming to do exon- or transcript isoform-specific analyses, I would just use the file:

*.rsem.genes.normalized_results

This contains normalised expression over genes. Thus, technically, you should be able to start conducting statistical tests and applying other downstream functions on them right away.

The open-access data for the TCGA also has RNA-seq (non-RSEM / 'version 1') for many projects, and the raw HT-seq counts should be available. These can readily be input into DESeq2 or EdgeR.