Biostar Beta. Not for public use.
TCGA CNV SNP 6.0 tumor data files: more than 1 sets for same sumbitter ID
1
Entering edit mode
2.3 years ago
ncafung • 20

I am conducting CNV analysis on TCGA Level 3 SNP 6.0 data. In few of the downloaded tumor samples, I found more than 1 seg files associated with same TCGA submitter ID. For example,

TCGA-44-2656-01A CUTCH_p_TCGAb_355_37_52_NSP_GenomeWideSNP_6_H10_1376764.nocnv_grch38.seg TCGA-44-2656-01A EGGAR_p_TCGAb33and37_SNP_N_GenomeWideSNP_6_H04_585228.nocnv_grch38.seg TCGA-44-2656-01A HILLY_p_TCGA_b90_wRedos_SNP_N_GenomeWideSNP_6_A04_748062.nocnv_grch38.seg

The first file showed 415 rows. While the second and third files showed 197 and 271 rows, respectively. All three files showed Mean Seg Score for Chromosomes from 1 to 22 and X.

Under this kind of situation, what factors I should consider to select one of the three files to continue my downstream analysis?
Should I combine those chromosome regions that have overlapped fully or partially among the 3 files, if I decide to combine the seg data of the 3 files?

The UUID's of the above 3 samples followed the same order are: 89327245-3da1-4a96-bee3-5b84ae43401a, f19650bb-8ead-490b-9f91-d7c4b06bfe6b, 0e6071db-c44c-4958-ab95-087d44620893

Is there a means that I could find out more about the 3 samples with the above UUID's to facilitate the file selection?

ADD COMMENTlink
0
Entering edit mode
14 months ago
National Institutes of Health, Bethesda…

Yes, digging through these can be challenging and I do not have a direct answer for you without looking at other data. You might try this code to help determine what is "under the hood" for these file IDs. In R:

# need to install these
library(GenomicDataCommons)
library(listviewer)
res = files() %>% 
  filter(~ file_id %in% c("89327245-3da1-4a96-bee3-5b84ae43401a", 
    "f19650bb-8ead-490b-9f91-d7c4b06bfe6b",
    "0e6071db-c44c-4958-ab95-087d44620893")) %>% 
  select(available_fields('files')) %>% 
  results_all()
jsonedit(res)

This will open a viewer where you can interactively investigate the (very deeply) nested details of each file metadata. In RStudio, this looks like:

Imgur

You'll notice that these files are all derived from primary tumor, but from different aliquots. The histopath description for each associated slide implies that the sample is 85% tumor. As with any copy number data, you might need to go back further in the pipeline to define quality metrics that could help.

ADD COMMENTlink
1
Entering edit mode

Thanks! I got the same output as u showed above. As a newbie, would like to share that u will need the following libraries before u can execute the above R scripts.

library(devtools)

Then load the GenomeInfoDb

source('https://bioconductor.org/biocLite.R')
biocLite("GenomeInfoDb")
library(GenomeInfoDb)

biocLite('Bioconductor/GenomicDataCommons')

# if not installed the first time, then

biocLite('Bioconductor/GenomicDataCommons', 'force = TRUE')

# Check whether installed properly #

GenomicDataCommons::status()

library(GenomicDataCommons)

The the following 2 libraries

library(listviewer)
library(magrittr)

Then execute the above R script provided by Sean.

U can find more updated info on GenomicDataCommons from the URL below:

https://github.com/Bioconductor/GenomicDataCommons

ADD REPLYlink
0
Entering edit mode

Thanks for the great details. Minor adjustment--GenomicDataCommons is now in Bioconductor, so biocLite('GenomicDataCommons') is the preferred approach for installing.

ADD REPLYlink
0
Entering edit mode

One more detail...

Will need R version 3.4 or above for GenomicDataCommons to be installed properly.

I had version 3.3.3 originally, and the initial installation attempt failed.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1