Batch Retrieval of Conserved Domains from a List of IDs?
1
0
Entering edit mode
5.3 years ago
kayrouz.1 • 0

Background:

I have a large number (>100,000) of functionally unrelated protein sequences and I want to generate a roughly functional annotation for each. I've tried PROKKA and RASTtk, but they tend to return a large number of "hypothetical" results so I've changed my approach. So far I have used RPS-BLAST to query each against the NCBI conserved domain database, which successfully gives me the NCBI-CDD identifier of the best hit for each. Now I want to retrieve the "name" associated with each of these identifiers.

What I've tried so far:

I currently have a NCBI-CDD identifier that encodes the top domain hit for each of my sequences. The identifiers take the form of a 6-digit number (i.e. 240628). I want to retrieve the "name" associated with each of these identifiers (for 240628, the name is "Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains").

In the past I have been able to retrieve information from NCBI using the command line to cycle through EFetch commands and write the output to a file, such as shown below:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_030019524.1&rettype=xml&retmode=ipg

However, when I try this approach with CDD identifiers, I just get a web-based XML readout that I can't really work with. See below for what I mean:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=cdd&id=240628

Is there any way to structure this command such that I receive a downloadable file that I can extract information from?

ncbi conserved domain entrez • 1.4k views
ADD COMMENT
4
Entering edit mode
5.3 years ago
vkkodali_ncbi ★ 3.7k

You can use Entrez Direct for this as follows:

esummary -db cdd -id 240628 | xtract -pattern DocumentSummary -element Id,Subtitle
240628  Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains

If you have a file with a long list of identifiers, you can use epost to first upload that list as follows:

epost -db cdd -input <file.txt> | esummary -db cdd | xtract -pattern DocumentSummary -element Id,Subtitle
ADD COMMENT
0
Entering edit mode

You have solved my problem, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6