This site is a beta test.
Question: Python Programmatically obtaining the corresponding wildtype DNA sequence of a RefSeq Protein Identifier (e.g WP_016693432.1)
1
Entering edit mode
11 months ago
ahir29 • 10

Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.

For example, I have a fasta file after a blast search -

>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus]
MKALQYTEIGSEPVVVDVPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYIYGLPLTLGHEGVGTVAELGAGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDTAQQVVAIDGHISVVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMDVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTVAELGDGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TVVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTIAELGAGVTGFEK 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAAPLTDAGLTPYH 
AISRVLPLLGPGSMAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG

I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1) - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).

Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!

Thank you

ADD COMMENTlink 15 months ago ahir29 • 10 • updated 11 months ago Biostar 20
2
Entering edit mode
15 months ago
genomax 68k
United States

By representing identical proteins using a single non-redundant protein accession number (with the prefix WP_), redundancy in the database is significantly reduced.

So WP numbers actually represent multiple proteins though only one is indicated in the header. More information here.

ADD COMMENTlink 15 months ago genomax 68k
Entering edit mode
0

Thanks genomax - I cant believe my vigilant googling didnt get me to that page explaining where all the transcripts of a reference seq is kept! Need to improve my searching skills:) I am assuming now using the Entrez etuils wrapper in Biopython can get me the necessary transcripts. I will take a look at doing it myself!

ADD REPLYlink 15 months ago
ahir29
• 10
Entering edit mode
0

That may not be directly possible since WP accessions are not linked on backend with a single DNA sequence. I did try to see if efetch worked briefly (it did not).

ADD REPLYlink 15 months ago
genomax
68k
Entering edit mode
0

Is there no way to use the Entrez utils wrapper in biopython to access the "identical proteins" link ? I.e for https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1 ; this would lead to "https://www.ncbi.nlm.nih.gov/ipg/?term=WP_010594839.1" ; where any of the entries there gets me to a page which has a DNA transcript which I am happy to use. How do I get to that particular entry without web scraping or using any other tool that python/biopython?

ADD REPLYlink 15 months ago
ahir29
• 10
Entering edit mode
0

A lot of times I find eUtils to be in "oh so close but no cigar" category (Disc: I may be doing something wrong as well).

Something odd seems to be happening with querying ipg database and the retrieval format. You can test this and let me know if you see the same thing.

$ esearch -db ipg -query "WP_010594839.1" | efetch -format fasta
>CAA39212.1 extensin (class I) [Solanum lycopersicum]
PPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPDPSP
PPPYYYKSPPPPSPSPPPPSPSPPPPTYSSPPPPPPFYENIPLPPVIGVSYASPPPPVIPYY

$ esearch -db ipg -query "WP_010594839.1" | efetch -format docsum

https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20171128/esummary_ipg.dtd">

<DocumentSummarySet status="OK">
<DbBuild>Build180808-2026m.1</DbBuild>

<DocumentSummary><Id>1345531</Id>
        <Caption>WP_010594839</Caption>
        <Title>NAD(P)-dependent alcohol dehydrogenase</Title>
        <IPG>1345531</IPG>
        <TaxId>1827</TaxId>
        <ProteinCount>27</ProteinCount>
        <Accession>WP_010594839.1</Accession>
        <ModificationDate>2018/05/08 00:00</ModificationDate>
        <Slen>346</Slen>
        <Organism>Rhodococcus</Organism>
        <BlastName>high GC Gram+</BlastName>
        <Div>BCT</Div>
        <TaxIdCount>5</TaxIdCount>
        <SpeciesCount>2</SpeciesCount>
        <GeneraCount>1</GeneraCount>
        <AssemblyCount>14</AssemblyCount>
        <KingdomCount>1</KingdomCount>
        <GenomeCount>0</GenomeCount>
        <RowCount>37</RowCount>
</DocumentSummary>

</DocumentSummarySet>
ADD REPLYlink 15 months ago
genomax
68k
Entering edit mode
0

Yup, i see the same thing. You are right, I am find the etuils hit and miss. I am resorting to webscraping the ipg page now. No other choice.

ADD REPLYlink 15 months ago
ahir29
• 10

Login before adding your answer.

Powered by the version 1.5.2