I'm completely new to the gdc.cancer.gov portal and I need some help (I saw here are similar questions on this forum).
What I need to do is to download Gene Expression quantification data (using HTSeq-FPKM-UQ) for breast cancer and use these data to classify cancer subtypes (luminal A, B, HER2-like, basal-like).
To retrieve the labels I basically have 2 options (feel free to add more):
1) Get the sample id in the 'old' TCGA-barcode format (eg. "TCGA-AR-A1AL-01") and use a dictionary which I downloaded from an old article using the same data which directly maps barcode to subtype.
The problem here is that I have no idea of how to get the TCGA-barcode format and it looks like the old API to do that does not work anymore.
2) Download the clinical data also and check the fields linked to ER, PgR, HER2 to manually assign labels.
However, once I download the EXP data, I basically lose any metadata and I don't know how to join the two files (EXP, clinical) in order to assign labels.
I know there must be a way of using API to do what I need.
If you are familiar with R, then things are really easy. There is a package called TCGABiolinks. For your case examples 3 and 4 are useful. Make sure to use legacy = T, since I am not sure whether the subtypes exist for the updated data.
If you are uncomfortable with R then you have to download these two data (clinical and mRNA) separately, making sure that each meta data file is also downloaded along with it. Then a script has to be written to match the barcodes from both. Barcode would be typically like this (TCGA-G4-6317-02A-11D-2064-05), an example. If you delimit it by '-' then the first three would be characterisitic of a sample i.e TCGA-G4-6317 should be sufficient to define the patient. Extract the barcodes from both the data and look at the first three columns delimited by '-' as stated above in example. This would allow you to map samples from both the data.
Hope this helps.