Question

Find an existing nucleotide sequence for a specific protein sequence through NCBI eutils

1

Entering edit mode

8.4 years ago

Jenez ▴ 540

Hello!

I am attempting to find corresponding nucleotide sequences to a list of protein sequences that I have attained from a blastp search. So far I've attempted to find these sequences through the use of NCBI's Eutils.

I've attempted to Esearch the protein database with the given protein ID's I have, which return result object which can be further used with Elink. With Elink, I've targeted the nuccore database and gene database to find connections between protein sequence and nucleotide sequence. This has worked to some degree, but the problem I'm facing is that there seems to be no single good robust way to find a corresponding nucleotide sequence for a given protein sequence. It works for the most part, but sometimes, for example when using a protein ID that corresponds to a multi-species entry, there will be no gene link. If you instead try to elink to nuccore, you sometimes miss information in the xml output that is essential to picking out the right sequence.

Is there a robust, always working method to doing this? Every venue turns out to be more complicated than it should be.

efetch esearch eutils elink sequence • 3.3k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Jenez ▴ 540

0

Entering edit mode

Did you ever find a way to do this? I am attempting to find the longest transcripts for each gene, so far, I can pull out all of the proteins for a gene, find the protein with the longest length, and then I want the coding sequence for this protein. I've asked my question here and here but so far no luck! I cannot do it manually as there are too many sequences to get.

ADD REPLY • link 7.7 years ago by Tom ▴ 40

0

Entering edit mode

What I ended up doing was to query with eutils for a so called identical protein report (IPG). http://www.ncbi.nlm.nih.gov/protein/281485550?report=ipg

From here, you often (not always) find a link from protein to nucleotide sequence. I've set it up such that I can query NCBI for the IPG report in xml format, parse out the information that i want (namely nucleotide accession and coordinates, as well as strand orientation). Using the nucleotide accession, i make another query to retrieve the xml report for it, and using the strand information i retrieved earlier i can parse out the nucleotide sequence i want.

It's not beautiful by any means, it often crashes and requires manual fixing when the IPG reports look weird, and there's probably a billion ways of doing it better.

Also, if you have many sequences you need to retrieve then this might not be the best idea as it might take a while and it would put quite a heavy load on the ncbi servers.

ADD REPLY • link 7.7 years ago by Jenez ▴ 540