This question might be more R than bioinformatics, but I'm trying to find which part of the logical pipeline causes an error, so please bear with me.
I recently ran a cuffdiff
operation and used cummeRbund
to read the output. Then, with the diffData()
function, I have extracted differential expression data frames. I am now using the gene_id
to fetch a human readable description of these significantly diff expressed genes, and the fetching is done by reutils
, the R package for eutils
.
In the search query, I need to restrict to mouse genes, so I query the gene database with the query gene_id AND Mus musculus[organism]
. I then manipulate the output with (content
and strsplit
and subscripting) to pick just the first line of the output for annotation. (I know it's a jury rigged solution, and if you have better alternatives, please suggest. But that is not the primary problem)
When I run the command:
diff.genes.sig$gene_annotation <- strsplit ( content ( efetch ( esearch( term = paste( diff.genes.sig$gene_id,' AND Mus musculus[organism]',sep=""), db="gene"), rettype = 'gene_table', retmode = 'text', retmax=1), as='text')[1],split = "\n")[[1]][1]
Every row in the data frame is annotated with the output from the first row. Is the fetching not being repeated for each row? Is there some kind of cache in action?
I use a for loop to bypass this now. But there has to be a better way, right? What are your thoughts on this?