Biostar Beta. Not for public use.
GenomicDataCommons request timeouts cases() %>% ... %>% results_all()
1
Entering edit mode
14 months ago
mk • 90

I've been experimenting with the GenomicDataCommons package to handle query work against the GDC API. For some reason there is an issue with timeouts for requests of a certain length. There doesn't seem to be a direct way around this using the piped syntax. Anyone else have luck with this?

There is a results(size = n) method, its syntax seems to allow only the first n records to be accessed.

Here is an example query (should return 400-500 records):

proj <- 'TCGA-COAD'
case_data <- cases() %>%
  GenomicDataCommons::filter(~ project.project_id == proj) %>%
  GenomicDataCommons::expand('diagnoses') %>%
  results_all() %>%
  as_tibble()

Gives, after a few moments:

Error in is.response(x) : Internal Server Error (HTTP 500).
ADD COMMENTlink
0
Entering edit mode

It runs here but there's no output (?):

require(GenomicDataCommons)
require(tibble)
case_data <- cases() %>%
   GenomicDataCommons::filter(~ project.project_id == proj) %>%
   GenomicDataCommons::expand('diagnoses') %>%
   results_all() %>%
   as_tibble()
case_data
# A tibble: 0 x 0


sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.0.1             GenomicDataCommons_1.6.0 magrittr_1.5            
[4] arm_1.10-1               lme4_1.1-19              Matrix_1.2-15           
[7] MASS_7.3-51.1
ADD REPLYlink
0
Entering edit mode

Thanks @Kevin Blighe, sloppy cut/paste job I forgot to add the project to the filter() method. Edited above.

ADD REPLYlink
3
Entering edit mode
14 months ago
National Institutes of Health, Bethesda…

I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:

proj <- 'TCGA-COAD'
query = cases() %>%
    GenomicDataCommons::filter(~ project.project_id == proj) %>%
    GenomicDataCommons::expand('diagnoses')
count = query %>% count()
size = 50
reslist = lapply(seq(1,count, size), function(start) {
    query %>% 
        results(size=size, from = start) %>%
        as_tibble()
})
case_data = bind_rows(reslist)

Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).

ADD COMMENTlink
0
Entering edit mode

@Sean Davis, thanks a lot for this info. Frankly this package is a life saver for me, since it addresses the need for general purpose GDC usage.

ADD REPLYlink
0
Entering edit mode
14 months ago
mk • 90

I have a fix for this but it's not very satisfactory. Based on Sean Davis' blog post I retrieved sequencing data and extracted the case id's from that, then looped over the case ids fetching diagnosis data 50 records at a time. The fact that this worked is a mystery to me, since the query against the files() endpoint returns even more records than the query posted above (each case may contain multiple samples).

First get the files, and their associated cases:

proj <- 'TCGA-COAD'
tm_ge_files = files() %>%
GenomicDataCommons::filter(~   cases.samples.sample_type=='Primary Tumor' &
                           cases.project.project_id == proj &
                           analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
as_tibble()
tm_cases = bind_rows(tm_ge_files$cases, .id='file_id')

Now get the diagnoses:

left <- 1
right <- min(50, length(tm_cases$case_id))
clin <- gdc_clinical(tm_cases$case_id[left:right])$diagnoses
while(left < length(tm_cases$case_id)){
  left <- min((left + 50), length(tm_cases$case_id))
  right <- min((right + 50), length(tm_cases$case_id))
  clin <- rbind(clin,gdc_clinical(tm_cases$case_id[left:right])$diagnoses)
}
ADD COMMENTlink
1
Entering edit mode

Yes, I was just about to say that I reproduced the error and that you should report on the Bioconductor forum (and link back to this thread), where Sean Davis may pick it up more quickly: https://support.bioconductor.org/t/Latest/

ADD REPLYlink
1
Entering edit mode

Ok, I threw up a link on Bioconductor forum. In case the answer gets posted there I'll update this thread.

ADD REPLYlink
0
Entering edit mode

Thanks - I'm a user there too but much less reputation score: https://support.bioconductor.org/u/16406/

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1