Biostar Beta. Not for public use.
How to get fasta sequences for CDS if I have proteins IDs?
0
Entering edit mode
14 months ago

I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?

sequence gene • 412 views
ADD COMMENTlink
0
Entering edit mode

which organism? can you use Ensembl/BioMart? how can I bake a pie?

ADD REPLYlink
0
Entering edit mode

let's put a bit more flesh to that bone http://www.ensembl.org/biomart/martview/

ADD REPLYlink
0
Entering edit mode

Always provide a few examples when asking this type of question. Protein ID's could be anything and the answer will depend on what kind they are.

NCBI's unix utils would almost certainly work if the ID's are from GenBank.

ADD REPLYlink
0
Entering edit mode

Oh, my bad. All IDs are from GenBank Escherichia genome assemblies (.faa files). For example, AAN78512.1, BAB33431.1, BAB33432.1.

P.S. I know that I can simply go to NCBI and get CDS for each protein manually but the question is how to do this for a large number of ID groups. I've heard something about EDirect but maybe there is a common way to do this with one line.

ADD REPLYlink
0
Entering edit mode

If you need to get all CDS's for E. coli O157:H7 then those are available here. If the ID's are from different genomes then it is a different problem. Let me look into it some.

ADD REPLYlink
0
Entering edit mode

IDs are from different genomes. In fact, I have a table with protein IDs:

           group1   group2   group3   group4   ... 
bac1          ID1      ID2      ID3      ID4
bac2          ID5      ID6      ID7      ID8
bac3          ID9     ID10     ID11     ID12
...

and I need to get a file with fasta sequences of CDS for each group. Suppose I have all .fna assembly files. Could I use BioPython to get the files?

ADD REPLYlink
0
Entering edit mode
13 months ago
h.mon 25k
Brazil

Try:

efetch -db protein -format fasta_cds_na -id AAN78512

edit: works the same with:

efetch -db protein -format fasta_cds_na -id AAN78512.1
ADD COMMENTlink
0
Entering edit mode

Thank you! But is it possible to use the command for > 500 IDs? Documentations says 'a comma-delimited list of UIDs may be provided... but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method'.

ADD REPLYlink
0
Entering edit mode

You could run the efetch command via a loop. Be sure to sign up for an NCBI API_KEY and use it. Use discretion when sending in those queries so as to not get IP banned.

ADD REPLYlink
0
Entering edit mode

Hi! When I try to run the same command, efetch does not take any action but just prints out the help. Any clue why this happens?

ADD REPLYlink
0
Entering edit mode

this can have many reasons, the most frequent problem is a typo. In case you want more profound help, please post your exact command here. Please use the 101010 code formatting button (fifth in the ribbon above)

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1