Sequence id conversion & extraction
3
0
Entering edit mode
8.0 years ago

How to convert the following IDs to GenBank nucleotide/gene ID & to retrieve nucleotide sequence in fasta format.

WP_003873519.1 WP_003892827.1 WP_003898538.1 WP_011739072.1 WP_023365853.1 WP_023366251.1 WP_023367098.1 WP_023369150.1 WP_023369156.1 WP_023369158.1 WP_023370844.1 WP_023370846.1 WP_023370878.1 WP_023370926.1 WP_023370928.1 WP_023370930.1 WP_023370920.1 WP_023370924.1 WP_023370932.1 WP_023370934.1 WP_023370951.1 WP_023370953.1 WP_023370955.1 WP_023370958.1 WP_023370962.1 WP_023370964.1 WP_023370966.1 WP_023370968.1 WP_023371064.1 WP_023371068.1 WP_023371072.1 WP_023371103.1 WP_023372313.1 WP_023372327.1 WP_023372450.1 WP_023373686.1 WP_023373729.1 WP_023373731.1 WP_023364557.1 WP_023369615.1 WP_023370856.1 WP_023371070.1 WP_023373127.1 WP_036395287.1

conversion retrieval gene id • 1.7k views
ADD COMMENT
2
Entering edit mode
8.0 years ago
Sej Modha 5.3k

To retrieve gene ID from the protein you can use following command

esearch -db protein -query "WP_003873519.1"|elink -target gene|efetch

Following command will give you all CDS and then you can extract WP_003873519.1from the all CDS fasta file.

esearch -db protein -query "WP_003873519.1"|elink -target gene|elink -target nuccore|efilter -query refseq|efetch -format  fasta_cds_na > output.fa
ADD COMMENT
2
Entering edit mode
8.0 years ago
GenoMax 141k

If you have access to the pre-formatted refseq_protein blast database then the easiest option is to use the blastdbcmd utility (part of blast+) to iterate over the ID's and get the sequence in fasta format like so:

$ blastdbcmd -db /path_to/refseq_protein -entry WP_023370932.1 -outfmt "%f"

You could also use -entry_batch option and provide a file with the ID's (one ID on each line).

$ blastdbcmd -db /path_to/refseq_protein -entry_batch file_w_ID -outfmt "%f"
ADD COMMENT
1
Entering edit mode
8.0 years ago
Sej Modha 5.3k

Are these protein IDs? Would you like to retrieve gene sequence or the protein sequence in fasta format?

ADD COMMENT
0
Entering edit mode

These are Protein RefSeq ID. I want to retrieve gene sequences in fasta format.

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6