extracting sequences from a blast database

2

Entering edit mode

5.0 years ago

max_19 ▴ 170

Hi all!

I'm creating a blast database using:

makeblastdb -in proteins.fasta -dbtype prot -parse_seqids -out my_protein_db

I was trying to extract some sequences from this using blastdbcmd but kept getting error messages of "Entry not found".

My entries look like this: (there is 1 pipe in each entry): ABC|DEF60375.1 EHL|XP_003887.1

However if i do check the identifiers in my database using:

blastdbcmd -entry all -db my_protein_db -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"

I get lines like this:

> OID: 0 GI: N/A ACC: ABC|DEF60375.1 IDENTIFIER: gnl|ABC|DEF60375.1 

> OID:0 GI: N/A ACC: EHL|XP_003887.1 IDENTIFIER: lcl|EHL|XP_003887.1

so it seems NCBI has added some text+a pipe infront of my identifiers, I can just concatenate these additional letters onto my entries when I use blastdbcmd, however I noticed that these letters are not always the same, for some cases it is "gnl|" and others it is "lcl|". Does anyone know how NCBI decides this naming convention? and whats the best way to get around this?

Thanks very much for any input

sequencing genome protein blast • 2.5k views

ADD COMMENT • link updated 5.0 years ago by GenoMax 141k • written 5.0 years ago by max_19 ▴ 170

0

Entering edit mode

What do fasta headers in your proteins.fasta look like? grep "^>" | head -3?

ADD REPLY • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

like this:

>
MKFSTLLKSNKLQGWEDFYIQYDNLIKYLKTDPLKFKNLLIKENTKITTFFNEIEEQANQQKNELLMLVKNNLIYDSSTK
YKNFKDKLYQNELID

ADD REPLY • link 4.9 years ago by max_19 ▴ 170

1

Entering edit mode

Which version of blast are you using?

See this page for additional detail.

Those are NCBI standard fasta identifiers.

ADD REPLY • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

blast+/2.6.0

I will check them out, thanks!

ADD REPLY • link 5.0 years ago by max_19 ▴ 170

Login before adding your answer.