Hi Community,
I am new here, and new to Python, so hopefully my question is eligible and clear. I have blastP results with protein accession IDs. I am using biopython to fetch the protein sequence, but I can't find a way to fetch the corresponding CDS from given GenPept file.
When I use the python script below , it works well for proteins like AAY99645.1, which have the CDS portion look like: CDS 1..1196 /gene="AHI1" /coded_by="DQ090887.1:339..3929"
The problem is that for a very large part of proteins that I need there is no CDS link from the protein entry (i.e WP_020126478.1). Therefore nothing is retrieved. The only way to get to CDS in these cases is to go to the "related info" link, parse it and find a cds portion there, as part of a whole shotgun genome sequence.
Is there something that I miss? Or is this the only way?
This is the code I use :
import re
import urllib
import fileinput
prots = [];
for line in fileinput.input():
line = line.rstrip('\r\n')
prots.append(line)
for i in range(len(prots)):
f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gb&id=" + prots[i])
for line in f.read().split('\n'):
m = re.search(r'coded_by=\"(.+)\"$', line);
if m is not None:
target = m.group(1)
accn = ""
start = 0
end = 0
rc = False
seq = ""
m = re.search(r'^(.+):(\d+)\.\.(\d+)$', target)
if m is not None:
accn = m.group(1)
start = m.group(2)
end = m.group(3)
else:
m = re.search(r'^complement\((.+):(\d+)\.\.(\d+)\)$', target)
if m is not None:
accn = m.group(1)
start = m.group(2)
end = m.group(3)
rc = True
f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id=" + accn + "&from=" + str(start) + "&to=" + str(end))
for row in f.read().split('\n'):
if len(row) == 0:
continue;
if row[0] == ">":
continue;
seq += row;
if rc == True:
revseq = seq[::-1]
rcseq = ""
d = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
for char in revseq:
rcseq += d[char]
seq = rcseq
print ">" + prots[i] + "|" + target + "\n" + seq
WP* records
So it is not surprising that you don't get a singe CDS sequence.
Not sure I understand... My specific example is annotated to a single genome shotgun sequence, still the record on protein database is not linked directly to its cds, which makes the job of retrieving it much more complicated than it is for other records..