Question

problem in matching the names between file names and patients Id in TCGA

2

Entering edit mode

5.9 years ago

nazaninhoseinkhan ▴ 520

Hi all,

I have downloaded total CNV files for a cancer from GDC portal.

I also have the clinical data for all patients, however I cannot map the names of file to submitter IDs.

The file name is some thing like "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt", while the submitter ID is something like "TCGA-DJ-A2QA".

Can any body guide me how to mach these two names?

Thank you in advance

Nazanin

TCGA CNV problem in matching • 5.6k views

ADD COMMENT • link updated 3.2 years ago by HB ▴ 30 • written 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Could you give us an example, for one patient, of what you have downloaded with links and/or pictures please ?

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I have downloaded the whole CNV files using TCGA2bed software.

These are some cnv files which have been downloaded: "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt",

"AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A02_735476.hg18.seg.txt"

I want to map these file to the clinical file that I have previously downloaded.

In the clinical file only submitter ID and patients ID is available.

ADD REPLY • link 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

sample names might be inside the text files. Did you check the headers of the files?

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

Hi,

No the header just includes the results, something like this: "Sample Chromosome Start End Num_Probes Segment_Mean AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 51598 9250000 4679 0.0076 AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 9250070 9324990 55 0.5138"

ADD REPLY • link 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Do you have the annotations.txt file coming with the CNV files ?

In this file you will have the entity_id which could also be found in the clinical files

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Hi, Yes I have also downloaded the annotation file. However it does not include the names of CNV files that I can use for matching. The following is the header of annotation file:

"category   classification  entity_type created_datetime    annotation_id   case_submitter_id   project/project_id  entity_submitter_id id
Alternate sample pipeline   Notification    case    2012-11-13T00:00:00 29ba39af-b266-547a-b2c9-7795eba2e202    TCGA-AB-2822    TCGA-LAML   TCGA-AB-2822    29ba39af-b266-547a-b2c9-7795eba2e202
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 3d086829-de62-5d08-b848-ce0724188ff0    TCGA-AG-A014    TCGA-READ   TCGA-AG-A014    3d086829-de62-5d08-b848-ce0724188ff0
Center QC failed    CenterNotification  aliquot 2012-07-20T00:00:00 5cf05f41-ce70-58a3-8ecb-6bfaf6264437    TCGA-13-0913    TCGA-OV TCGA-13-0913-02A-01R-1564-13    5cf05f41-ce70-58a3-8ecb-6bfaf6264437
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 c53f22b1-677b-5528-a438-39d5390e2c68    TCGA-21-1077    TCGA-LUSC   TCGA-21-1077    c53f22b1-677b-5528-a438-39d5390e2c68
"

ADD REPLY • link updated 5.9 years ago by GenoMax 141k • written 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Coming with your AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt you have an annotation file where you can find an entity_id which I think in this case is this one 29ba39af-b266-547a-b2c9-7795eba2e202 corresponding to case_id in your clinical file.

To check

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

The problem is I have downloaded the CNV files for 507 patients with TCGA2bed. I know that I can find the patients or submitter ID via GDC, but I cannot do this for all 507 cases manually and I am seeking a way to find the equal patients or submitter ID automatically.

In other word, I want to find the patients or submitter ID based on "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt". In annotation file there is no column including part of this "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt" name.

ADD REPLY • link 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

What are the commands you used ?

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

TCGA2bed is a graphical tool in which toy can select bet ween annotation and experiment. After selecting tumor type, you have to select the type of data: CNV,RNASeq,...

ADD REPLY • link 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

As I don't know this API and it's not open source, I can't really help you more. In your CNV files you have sample names, you can try to get a list of it.

Then, I found this in R (https://cran.r-project.org/web/packages/TCGAretriever/TCGAretriever.pdf) Which I think you can request TCGA database with your list of sample names.

Or you can try to contact persons from this publication (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1419-5)

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Hi,

I'm getting an error:

 Error in UseMethod("filter") : 
  no applicable method for 'filter' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"

Can anyone assist please?

ADD REPLY • link 3.2 years ago by HB ▴ 30

0

Entering edit mode

Please actually show the code that produced the error

ADD REPLY • link 3.2 years ago by Kevin Blighe 87k

score 4 · Answer 1 · 2018-06-05

4

Entering edit mode

5.9 years ago

Kevin Blighe 87k

Edit: original function written by Bioinfo (via Sean Davis' blog) for translating file UUIDs into TCGA barcodes ( C: Sample names for TCGA data from GDC-legacy archive ). This function (below) translates file names into TCGA barcodes.

A manual lookup of 507 samples is not that bad, if the desire is really there to get the work done. I have done manual lookups of >1000 TCGA samples back when there were no automated services.

The one solution that I thought would work was this function:

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_names, legacy = TRUE) {
  info = files(legacy = legacy) %>%
    filter( ~ file_name %in% file_names) %>%
    select('cases.samples.submitter_id') %>%
    results_all()

  id_list = lapply(info$cases, function(a) {
    a[[1]][[1]][[1]]})

    barcodes_per_file = sapply(id_list,length)

    return(
      data.frame(
        file_id = rep(ids(info), barcodes_per_file),
        submitter_id = unlist(id_list),
        row.names=file_names))
}

TCGAtranslateID('AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt')

Output

                                 file_id                               submitter_id
AMAZE_p_TCGASNP_b86_87...seg.txt 6352ceaf-99f4-4b74-94a2-dc5e405543f0  TCGA-BJ-A0Z9-01A

ADD COMMENT • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thank u so much. It worked

I'll never forget your helps

ADD REPLY • link 5.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

No problem. This function should also accept an entire vector of filenames, like:

c("filename1", "filename2", "filename3", "filename4",...)

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, I run the code successfully.

However I faced with another problem again. I have 1026 file names, however only 1020 IDs were found.

More over I did not get the file names (AMAZE_p_TCGASNP_b86_87...seg.txt) in the results to map them to my original input. The results include file_id(fda02baa-b6ba-47cd-88d2-20bd14a193a4) , submitter_ID (fda02baa-b6ba-47cd-88d2-20bd14a193a4) and third column (TCGA-BJ-A0Z9-01A).

Do I have to include "file_name" in "return(data.frame(file_id=rep(ids(info),barcodes_per_file), submitter_id=unlist(id_list), row.names=file_names))"?