Question

how to find a particular GEO dataset or cohort in cBioportal and TCGA

0

Entering edit mode

6.2 years ago

CuriusScientist ▴ 50

I am very new to the field of Cancer genomics so this might be very naive query

I am interested in finding the dataset of the following study "Whole-Genome Gene Expression Profiling of Formalin-Fixed, Paraffin-Embedded Tissue Samples" the data is submitted to GEO (Accession Number: GSE17599). However, I was not able to find this dataset on cBioportal. I thought maybe cBioportal has not added the data from TCGA but was not able to find it on TCGA as well.

Is there any way I can find data using GEO Accession Number in cBioportal, TCGA, GDAC firehose etc? Are there any other resources where I can find the datasets.

R package or script which does it automatically will be very helpful.

Thanks in advance

TCGA cBioportal GEO R GDAC • 2.9k views

ADD COMMENT • link 6.2 years ago by CuriusScientist ▴ 50

0

Entering edit mode

Perhaps you can define better what type of data you are looking for and why do you want to use cBioportal? to analyze cancer data? visualize? download? note that cBioportal does not necessarily overlap with GEO.. anyway you can also find/analyze cancer data here: https://xenabrowser.net/datapages/

ADD REPLY • link 6.2 years ago by roy.granit ▴ 880

0

Entering edit mode

I can see that in the link mentioned by you cohort: GDC TCGA Prostate Cancer (PRAD) (11 datasets) and cohort: TCGA Prostate Cancer (PRAD) (22 datasets)

what I want to know is how can I find what datasets are part of those 11 and 22 datasets? If I am not wrong some of them, if not all, must be part of GEO.

Moreover, why is there difference between number of datasets in GDC TCGA Prostate Cancer (PRAD) and cohort: TCGA Prostate Cancer (PRAD)?

ADD REPLY • link 6.2 years ago by CuriusScientist ▴ 50

0

Entering edit mode

what I want to know is how can I find what datasets are part of those 11 and 22 datasets? If I am not wrong some of them, if not all, must be part of GEO.

TCGA data is not in GEO.

why is there difference between number of datasets in GDC TCGA Prostate Cancer (PRAD) and cohort: TCGA Prostate Cancer (PRAD)?

The number of datasets is different because they are pulled from different sources. More importantly, the number of samples is basically the same. For RNA-seq, one has 550 samples and the other has 551 samples. One of the samples must have been removed at some point potentially due to QC issues.

ADD REPLY • link 6.2 years ago by igor 13k

0

Entering edit mode

That clears some of my doubts

ADD REPLY • link 6.2 years ago by CuriusScientist ▴ 50

0

Entering edit mode

Many thanks to @igor for clarifying my doubts :)

how you think can I proceed with the following task

I need to find the mutations in public databases
compare with in-house experimental results
perform analysis

Can I use the data/results from cBioPortal with those not part of cBioPortal like GSE17599?

ADD REPLY • link 6.2 years ago by CuriusScientist ▴ 50

0

Entering edit mode

Note that GSE17599 holds gene expression data and not sequencing data - so it could not be used to predict mutations

If you are looking to find mutations in cancer this is a good place to start: http://cancerhotspots.org/?ref=labworm#/download or you can download MAF files from TCGA or other studies.

Next I assume you will want to predict mutations in your own data, so you might want to checkout this post

Basically you cannot load you own data to cBioportal, but there are some functions that you can run on your own data - see here

ADD REPLY • link 6.2 years ago by roy.granit ▴ 880