Question

Fetching multiple FASTA from NCBI

3

Entering edit mode

7.9 years ago

mikael.lenz.strube ▴ 60

Hi,

I want to get a nucleotide FASTA file of all genes matching a search query. I have tried a couple of things using esearch, such as:

esearch -db nuccore -query "(mecA) GENE AND "bacteria"[porgn:__txid2]" | efetch -format fasta

esearch -db gene -query "(mecA) GENE AND "bacteria"[porgn:__txid2]" | elink -target nuccore | efetch -format fasta

but they both output very long nucleotide sequences, making me think im getting whole genomes in which the gene exists.

Funnily enough, it works fine when using -db protein, apart from obviously giving me protein fastas.

So, what am i doing wrong?

NCBI Efetch • 2.3k views

ADD COMMENT • link updated 7.9 years ago by piet ★ 1.8k • written 7.9 years ago by mikael.lenz.strube ▴ 60

score 0 · Answer 1 · 2016-05-25

0

Entering edit mode

7.9 years ago

piet ★ 1.8k

but they both output very long nucleotide sequences, making me think im getting whole genomes in which the gene exists

The mecA gene is located on a mobile element called 'staphylococal chromosomal cassette' (SCC). There do exist a few older sequences in Genbank which only comprise the mecA gene which has a size of about 2007 nt. But most of the nucleotide sequences comprising mecA are either full chromosomes of Staphylococci or comprise major parts of the SCC. If you are interested only in the coding sequence of mecA you have to download the full sequences and then cut them locally. However, mecA sequences are very well conserved, so comparing them is quite boring.

ADD COMMENT • link 7.9 years ago by piet ★ 1.8k

0

Entering edit mode

hi, thanks for input,

so mecA was just an example, the eventual purpose is to extract the sequences of any gene for a given name. I see the same with gyrA or lig and so one. But I think a major issue is that im getting whole genomes and/or casettes, as you say.

But the thing is: I can do a search at the NCBI homepage on e.g. mecA (or whatever) in the gene database and then get the FASTAs from each entry, so why cant that be automated? They clearly exist and are correctly linked to the names.

ADD REPLY • link 7.9 years ago by mikael.lenz.strube ▴ 60

0

Entering edit mode

With efetch, you always get the whole sequence. There is no way to download only part of a sequence. You have to download the whole sequence and cut out the region of your interest locally. It may be better to download in Genbank format in order to get the positions of all the annotated genes along with the sequence.

ADD REPLY • link 7.9 years ago by piet ★ 1.8k

0

Entering edit mode

With eutils it's possible to also specify start, end, and strand. It's a shame that this functionality is still not implemented in Entrez Direct..

ADD REPLY • link 7.9 years ago by 5heikki 11k

0

Entering edit mode

Alright, so i guess i have to go through genbank and fetch the positions along with the entire sequence. Any clever way of doing that apart from writing a filter manually?

Any other suggestions?

ADD REPLY • link 7.9 years ago by mikael.lenz.strube ▴ 60