Question about ClinicalData discrepancies b/w cbioportal data and GDC data?
0
0
Entering edit mode
2.4 years ago
Vasu ▴ 540

I recently downloaded the TCGA colorectal clinical data information from GDC portal. From this I got the following files.

nationwidechildrens.org_clinical_patient_coad.txt



I combined both the files and a total of 628 patients data is available. Among them I see

563 - Alive


For example

times   bcr_patient_barcode   patient.vital_status
154         TCGA-3L-AA1B              Alive
1200        TCGA-5M-AATE              Alive
648         TCGA-A6-2671              Alive


All the 628 patients have information available about Days_to_Last_followup.

Similarly, I checked the cbioportal TCGA Provisional colorectal clinical data cbioportal colorectal. Here the patient_vital_status is of different numbers.

502 - Alive
8 - NA


And in this, almost 60 patients had NA for Days_to_Last_followup. I'm interested in doing survival analysis. Now very confused to select the right one for the analysis.

For example

times   bcr_patient_barcode   patient.vital_status
154         TCGA-3L-AA1B              Alive
1200        TCGA-5M-AATE              Alive


So, from the data above both GDC and cbioportal show different information.

Looks like cbioportal clinical data is the updated one as it shows more patients ad Dead. But why some patients in cbioportal clinical info doesnt have Days_to_Last_followup? Which of the above is the right one for the Analysis?

thanq

survival gdc cbioportal tcga clinicaldata • 838 views
1
Entering edit mode

The GDC should be the most updated as it is the primary source of TCGA data. cBioPortal is a third-party (developed at MSKCC) that is not part of the NIH. The issue is that the clinical data may be referencing different samples / aliquots. cBioPortal may also have imputed missing values that they encountered in the original data that they pulled from the GDC.

I would always go by the data at the GDC because it is the primary source. It is a common finding that discrepancies exist between the GDC and the third party web-sites. You will be fine once you simply quote the exact source and version of your data. If no version is available, then date-stamp it in your methods.

Obviously patients cannot come back to life, so there are logical reasons behind the discrepancies that you observe.

0
Entering edit mode

If you say GDC is most updated one compared to cbioportal, I see 65 Dead in GDC and 130 Dead in cbioportal. This cannot be a small difference.

1
Entering edit mode

They could simply be referencing different patients from the same cancer - I am not sure. I have also heard that the GDC clinical data contains errors. It would be interesting to also see how the patient numbers appear on the GDC Legacy Archive. I would contact both cBioPortal (MSKCC) and GDC.

As the analyst, in certain situations, the best we can do is just date-stamp and version control the data that's given us, i.e., in order to protect our own butts.

0
Entering edit mode

Yes, there may be different patients in both cbioportal and GDC, but in my question there is one patient TCGA-A6-2671 which is alive in GDC and dead in cbioportal.

0
Entering edit mode

The information / paper trail for the patient may be difficult to find. Another option: just set to NA all discrepancies between both the GDC and cBioPortal, although then you reduce your sample n

0
Entering edit mode

I see the patients are same in both the portals.

From the same place where I downloaded patient clinical data for both colon and rectal in GDC, I have also downloaded the following files

nationwidechildrens.org_clinical_follow_up_v1.0_coad.txt



I see the vital status is different in this compared to patient clinical data. What is this follow_up files?

GDC

0
Entering edit mode

Getting the most out of the clinical data from the TCGA is indeed difficult, I admit. It has a high level of missingness.