Matched tumor/normal RNASeq expression counts in TCGA
1
1
Entering edit mode
2.4 years ago
ariel ▴ 250

I would like to download matched tumor/normal expression data from TCGA. I would also sample data like age, sex, race, etc.

I looked at the files

Mutations: mc3.v0.2.8.PUBLIC.maf.gz RNA: EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv

found here: https://gdc.cancer.gov/about-data/publications/pancanatlas

The RNA file is a matrix of expression data with rows->genes, and cols->barcodes. The maf file has the mutation data (kinda like vcf). It has columns for normal barcodes and tumor barcodes.

I was hoping I could match those to the columns in the expression data. However, each column has roughly the same number of barcodes as the expression data (columns), and there is no intersection.

Moreover, when I jointly count case barcode and sample type, there is only one sample type per person. So no tumor/normal pairs.

In the GDC data access portal, I tried creating a manifest like this

https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%2C%22STAR%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D

But it doesn't have a project selector, and I end up with >20k files, which is more than it will put in a manifest.

This gives >15k cases https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%2C%22STAR%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D&searchTableTab=cases

I've also looked at the ISB-CGC BigQuery tables.

https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html

I found the expression data and case data. These are pretty much the same as the pancan download files. Again, there is only one "SampleType" for each "ParticipantBarcode" or "CaseBarcode", so are there tumor/normal pairs?

I'm stumped.

tumor TCGA RNASeq expression normal • 912 views
ADD COMMENT
0
Entering edit mode

I am having a similar issue - I am searching for a master file which links the aliquot barcode (column names in "EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv" and sample level data such as "type" (primary tumor vs adjacent normal).

ADD REPLY
1
Entering edit mode
2.2 years ago
Zhenyu Zhang ★ 1.2k

TCGA cases do not have normal RNA-Seq with very few exceptions (less than 1%)

ADD COMMENT

Login before adding your answer.

Traffic: 1248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6