Tutorial:TCGA UUIDS to TCGA barcode (SampleID) in R
3
28
Entering edit mode
7.5 years ago

Update, 12th May 2018

Since this post was made, a rapid way to interrogate the GDC data for the purposes of converting UUIDs to TCGA Barcodes was found using R Programming Language. See the thread here: Sample names for TCGA data from GDC-legacy archive

Kevin,
Moderator.

For those not familiar with the command line and with the JSON query language, here is a fairly simple way to map UUIDS to TCGA barcode ID using R and a canned command in the terminal

The first part is in R

1) Extract the files ID from your manifest file (the one you get from the GDC after you downloaded your data)

setwd("C:/Here/your/manifest/directory")

manifest= "gdc_manifest_20160921_171519.txt" #Manifest name 
x=read.table(manifest,header = T)
manifest_length= nrow(x)
id= toString(sprintf('"%s"', x$id))

2) Create Payload.txt with the commands needed

This commands are extracted from the GDC website https://gdc-docs.nci.nih.gov/API/Users_Guide/Search_and_Retrieval/

Part1= '{"filters":{"op":"in","content":{"field":"files.file_id","value":[ '
Part2= '] }},"format":"TSV","fields":"file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id","size":'
Part3= paste(shQuote(manifest_length),"}",sep="")
Sentence= paste(Part1,id,Part2,Part3, collapse=" ")
write.table(Sentence,"Payload.txt",quote=F,col.names=F,row.names=F)

The second part is in the command line (CMD or terminal)

cd C:/Here/your/manifest/directory
curl --request POST --header "Content-Type: application/json" --data @Payload.txt "https://gdc-api.nci.nih.gov/files" > File_metadata.txt

Now you should have a file called File_metadata.txt in your working folder with all the data you need

If you get a message:

'curl' is not recognized as an operable program or batch file.

you should install the cURL library in your computer (if you don't know how to do it, follow this link)

next-gen GDC R TCGA • 20k views
ADD COMMENT
1
Entering edit mode

Hi, I try use this method for retrieving the sample ID, but it failed, the error in the File_metadata.txt is: { "message": "400 Bad Request: The browser (or proxy) sent a request that this server could not understand." }

how to fix it? Thanks.

ADD REPLY
0
Entering edit mode

Can you post your code?

ADD REPLY
0
Entering edit mode

Another answer here: A: Sample names for TCGA data from GDC-legacy archive Also check the blog of Seán Davis.

ADD REPLY
0
Entering edit mode

Thank you. It worked for my prostate cancer RNA-seq data.

ADD REPLY
0
Entering edit mode

thanks for the post, it was very useful ..

ADD REPLY
0
Entering edit mode

Thanks for this convenient solution!

I observed two small issues from my implementation.

1

The current URL for this search should be "https://api.gdc.cancer.gov/files" rather than "https://gdc-api.nci.nih.gov/files". I have "curl: (6) Could not resolve host: gdc-api.nci.nih.gov; Unknown error" using the latter.

2

In the R script

Part3= paste(shQuote(manifest_length),"}",sep="")

this will result in single quote of the size, which caused error in my searching. After the following modification, it worked for me.

Part3= paste0("\"",manifest_length, "\"", "}") #just change single quote ' to double quote "

Thank you very much!

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
7.3 years ago
Chun-Jie Liu ▴ 280

GDC provides API for Curl and HTTPie for command retrieving info through the UUID.

I wrote a simple python script for mapping UUID to TCGA barcode (submitterID). Just input the manifest file downloaded from GDC Data-Portal. Defaul is latest version, you can't use legacy archive UUID to convert through latest version. You may change the endpoint to your version by yourself.

files_endpt = "https://gdc-api.nci.nih.gov/<version>/legacy/<endpoint>"

ADD COMMENT
0
Entering edit mode

Hi, there is some errors when I used it,

Traceback (most recent call last):
  File "m2s.py", line 76, in <module>
    main()
  File "m2s.py", line 73, in main
    run(args.manifest)
  File "m2s.py", line 69, in run
    gdcAPI(file_ids, manifest)
  File "m2s.py", line 62, in gdcAPI
    response = requests.post(files_endpt, json = params)
  File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 88, in post
    return request('post', url, data=data, **kwargs)
  File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'json'

Thanks,

ADD REPLY
1
Entering edit mode

Please try python3. And add requests module.

Or you can use R version

ADD REPLY
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Hi, I tried with python3 but getting same error like this Traceback (most recent call last): File "m2s.py", line 76, in <module> main() File "m2s.py", line 73, in main run(args.manifest) File "m2s.py", line 69, in run gdcAPI(file_ids, manifest) File "m2s.py", line 62, in gdcAPI response = requests.post(files_endpt, json = params) File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 88, in post return request('post', url, data=data, *kwargs) File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 44, in request return session.request(method=method, url=url, *kwargs) TypeError: request() got an unexpected keyword argument 'json'

Any help is much appreciated.

Thanks

ADD REPLY
2
Entering edit mode
5.1 years ago
david.peeney ▴ 30

I had a hard time getting a number of these answers to work... eventually using the following sequence to convert legacy UUIDs to full barcodes and harmonized UUIDs:

Firstly, you need to download the JSON manifest files from your selected study and file types from the GDC legacy archive (NOT GDC portal).

Then, use the following R script:

library(dplyr)
library(jsonlite)
legacy = fromJSON(txt = "~/Downloads/metadata.cart.2019-03-07 (1).json")
legfnames = legacy[["file_id"]]
entities = legacy[["associated_entities"]]
IDconversion = bind_rows(entities, .id = "column_label")
IDconversion['legacy file names'] = legfnames
ADD COMMENT
0
Entering edit mode

Thanks David. Using the JSON file from Legacy Archive is indeed one tried and tested solution that I have also used.

ADD REPLY
0
Entering edit mode
4 weeks ago
aUser ▴ 30

Great solutions on this page, the old web page has been replace with new one, so you need to use the new link, and it will work:

old link: https://gdc-api.nci.nih.gov/files

New link: https://api.gdc.cancer.gov/files

ADD COMMENT

Login before adding your answer.

Traffic: 2793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6