How do I set filter for NCBI esearch to get fasta for Genes only?
2
2
Entering edit mode
5.5 years ago
MAPK ★ 2.1k

I am using esearch query as $query = "SS1G_03709+AND+gene[filter]";, but it gives me all (gene+ mRNA+genome sequences). What filter do I need to use so I only get gene sequences in my search? I tried a few filters from here, but couldn't find anything to limit my search for genes.

esearch • 4.0k views
ADD COMMENT
7
Entering edit mode
5.5 years ago
GenoMax 141k

One way (sequence truncated for brevity):

$ esearch -db nuccore -query "SS1G_01676 [GENE]"  | efetch -format fasta | grep ">" | grep -v "genome" | awk '{print $1}' | epost -db nuccore | efetch -format fasta
>XM_001597432.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA
ATGGCGCCCAAATTTTCGGAAGACGAAATTGATGATTTGATATATTTTGCTCGTATTGGGGACAATGATG
AATTCGAGAAATTGAGGGAGGAGTTATGTAAAAGGGAAGGATGTTCGATTGCTGAACTGCTAGAGACTGC

$ esearch -db nuccore -query "SS1G_03709 [GENE]"  | efetch -format fasta | grep ">" | grep -v "genome" | awk '{print $1}' | epost -db nuccore | efetch -format fasta
>XM_001595570.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA
ATGCATTTCTCAACTGCAAAAACGCTTCTTCCTCTCGCAGTTCTAGTTTCCTATACCACCGCTCAAACAA
CAGCTGCAGCACCACCTGTTGCTAGTGCTCCTACAGGCGGCACTTCTAGTACTTGTCTCGGACAAAATGT

@vkkodali has much neater ways of finding this info in that answer. Taking some inspiration from one of the command there you could ust do

$ esearch -db nuccore -query "Sclerotinia sclerotiorum 1980 [TITLE]"  | efilter -molecule mrna | efetch -format fasta > s_sclerot.fa

to get all of them at one time.

ADD COMMENT
0
Entering edit mode

@genomax Thank you, but my interest is to get the fasta for these, the actual gene records: https://www.ncbi.nlm.nih.gov/gene/?term=SS1G_01676
OR https://www.ncbi.nlm.nih.gov/nuccore/NW_001820834.1?report=fasta&from=1555069&to=1556099&strand=true

Is it possible?

ADD REPLY
1
Entering edit mode

You can use the -format gene_fasta option of efetch to get the FASTA sequence of the genes annotated on a genomic RefSeq as shown below:

esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta | grep -A1 'SS1G_01676'

One issue that you may notice is that we are downloading the sequences of every gene annotated in FASTA format. While this may not be such a big deal for a handful of gene queries, this can become a performance issue with many many queries. In a situation like that, you can use the -format ft of efetch to first get the feature table; extract the coordinates for the gene of your interest and use bash scripting with efetch -seq_start ### -seq_stop ### -format fasta to finally get the sequence of just the gene of your interest.

ADD REPLY
1
Entering edit mode

Wonder if @MAPK wants is this kind of header

>NW_001820834.1:c1556099-1555069 Sclerotinia sclerotiorum 1980 UF-70 scaffold_2 genomic scaffold, whole genome shotgun sequence

instead of

>XM_001595570.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA

since the sequence should be identical.

Slightly off-topic: You clearly have deep knowledge about eUtils! Do you work at/for NCBI?

ADD REPLY
0
Entering edit mode

@vkkodali Thank you, this is what I wanted.

ADD REPLY
1
Entering edit mode
$ esearch -db nuccore -query "Sclerotinia sclerotiorum 1980 [TITLE]"  | efilter -feature gene | efetch -format gene_fasta

This should get you the gene record multi-fasta.

ADD REPLY
6
Entering edit mode
5.5 years ago
vkkodali_ncbi ★ 3.7k

If you are interested in just the FASTA sequences of mRNAs, you can use elink as follows:

esearch -db gene -query 'SS1G_01676' | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta

If you happen to know the unique gene ID (for example, it is 5493342 for SS1G_01676), you can skip the esearch command entirely and go directly to the elink step as follows:

elink -db gene -target nuccore -name gene_nuccore_refseqrna  -id 5493342 | efetch -format fasta

Another tool that you may find useful is efilter which you can use to filter results based on source (refseq, genbank, etc) and/or molecule type (genomic, mrna, etc) as follows:

esearch -db nuccore -query "SS1G_01676 [GENE]" | efilter -molecule mrna | efetch -format fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6