How to download large protein data from NCBI?
2
0
Entering edit mode
5.8 years ago
enriqp02 ▴ 30

Hello everybody!

I'm working with metaproteomics samples and I want to use different search engines to look for all the proteins. In order to do this, I need a good protein databases. I tried to download from NCBI webpage, but the size of the dataset I want to get (all bacteria proteins) is too large to download via the web. There is also no pre-created set available for downloading via FTP.

I found on the internet this script in Perl... But the time is required to download take months.

#!/usr/bin/perl -w
#Based on www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl  
#Usage: perl ncbi_fetch.pl > output_file

use LWP::Simple;
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" ."db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print STDERR "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart=0;
my $retmax=100000;
while ($retstart<$Count) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print STDERR "Donwloading database $retstart / $Count\n";

my $efetch_result = get($efetch);
my $copy=$efetch_result;
my $countSeqs=0;
if ($report eq 'fasta') {
    $countSeqs= $copy =~ tr/\>//;
    }
elsif ($report eq 'genbank') {
    $countSeqs = $copy =~ tr/\/\///;
} 
elsif ($report eq 'acc') {
    $countSeqs = $copy =~ tr/\n//;
}

my $expected=$retmax;
if ($retstart>$Count-$retmax) {
    $expected=$Count-$retstart;
}

if ($countSeqs>=($expected-100000)) {
    print "$efetch_result";
    $retstart+=$retmax;
} 
else {
    print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n";
    }
}

sub ask_user {
print STDERR "$_[0] [$_[1]]: ";
my $rc = <>;
chomp $rc;
if($rc eq "") {
    die "Error: Empty field: $_[0]\n";
}
    return $rc;
}

Do you know another way to do this?

Thanks in advance

protein NCBI • 2.7k views
ADD COMMENT
0
Entering edit mode

This Python script that I wrote downloads the FASTA sequence of all proteins matching a keyword, across all species. It is configurable, though. See if you can avail of it: A: How to download all sequences of a list of proteins for a particular organism

Edit: You are looking for the actual amino acid sequence, I presume?

ADD REPLY
0
Entering edit mode

Thank you! I used your script and works perfectly. But one question, if I type in NCBI: human, there are many values that I'm not interested.

In this case: Animals(1,419,740) Plants(4,494) Fungi(898,540) Protists(203,856) Bacteria(84,639,903) Archaea(6,043) Viruses(1,749,114)

In my case, I would like to use: "Homo sapiens"[Organism] but in this case, your script doesn't work. Is there any solution for this?

Thanks again

ADD REPLY
0
Entering edit mode

I could see as well that only 20 sequences are downloaded in human :S.

ADD REPLY
0
Entering edit mode

Any solution to this problem?

ADD REPLY
0
Entering edit mode

Yes, for human data, just replace this line:

LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

...with this:

LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"

txid9606 is a reference to Homo sapiens

ADD REPLY
0
Entering edit mode

Thanks for your reply.

I still having the same problem. Only download 20 sequences, as they appear on the website.

ADD REPLY
0
Entering edit mode
5.8 years ago
GenoMax 141k

You can use NCBI eUtils which you will need to download and install.

 esearch -db protein -query "2[taxid] AND refseq [filter]" | efetch -format acc

(change acc to fasta if you want to get actual sequence). We are using 2 as taxID which is for bacteria (refine as needed) along with refseq as filter to get curated proteins only. If you replace 2 with 9606 you till get the same information for humans.

As for time there is not much you can do. This is going to depend on your internet speed (NCBI has plenty of bandwidth on their end). You could retrieve the proteins from blast indexes (nr) if you have those available. But that is still a large download so if you don't have those indexes it make take the same or longer as far as time goes.

ADD COMMENT
0
Entering edit mode
ProteomicaVI@DESKTOP-GTTSG80:~/Base_de_datos_Lola$  esearch -db Protein -query "9606[taxid] AND refseq [filter]" | efetch -format fasta
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain'
Result of do_post http request is
$VAR1 = bless( {
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'client-date' => 'Thu, 28 Jun 2018 12:07:46 GMT'
                                      }, 'HTTP::Headers' ),
                 '_request' => bless( {
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' ),
                                        '_content' => 'db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain',
                                        '_method' => 'POST',
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.34'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

WebEnv value not found in search output - WebEnv1
Db value not found in fetch input

I followed your steps but I got this bug.

ADD REPLY
0
Entering edit mode

You need to install LWP::Protocol::https. If you have permissions to install things on this machine then follow the directions here. Use the appropriate one for OS you are using.

ADD REPLY
0
Entering edit mode

I am not sure if this tool would exclude the protein data but maybe you can try https://github.com/kblin/ncbi-genome-download

ADD REPLY
0
Entering edit mode
5.8 years ago
tdmurphy ▴ 190

NCBI RefSeq includes nearly all bacteria proteins, and has files available for download at: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

ADD COMMENT
0
Entering edit mode

Thanks a lot.

Do you know what file it is? There are many protein.faa

ADD REPLY

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6