Question

How to download large protein data from NCBI?

0

Entering edit mode

5.8 years ago

enriqp02 ▴ 30

Hello everybody!

I'm working with metaproteomics samples and I want to use different search engines to look for all the proteins. In order to do this, I need a good protein databases. I tried to download from NCBI webpage, but the size of the dataset I want to get (all bacteria proteins) is too large to download via the web. There is also no pre-created set available for downloading via FTP.

I found on the internet this script in Perl... But the time is required to download take months.

#!/usr/bin/perl -w
#Based on www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl  
#Usage: perl ncbi_fetch.pl > output_file

use LWP::Simple;
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" ."db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print STDERR "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart=0;
my $retmax=100000;
while ($retstart<$Count) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print STDERR "Donwloading database $retstart / $Count\n";

my $efetch_result = get($efetch);
my $copy=$efetch_result;
my $countSeqs=0;
if ($report eq 'fasta') {
    $countSeqs= $copy =~ tr/\>//;
    }
elsif ($report eq 'genbank') {
    $countSeqs = $copy =~ tr/\/\///;
} 
elsif ($report eq 'acc') {
    $countSeqs = $copy =~ tr/\n//;
}

my $expected=$retmax;
if ($retstart>$Count-$retmax) {
    $expected=$Count-$retstart;
}

if ($countSeqs>=($expected-100000)) {
    print "$efetch_result";
    $retstart+=$retmax;
} 
else {
    print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n";
    }
}

sub ask_user {
print STDERR "$_[0] [$_[1]]: ";
my $rc = <>;
chomp $rc;
if($rc eq "") {
    die "Error: Empty field: $_[0]\n";
}
    return $rc;
}

Do you know another way to do this?

Thanks in advance

protein NCBI • 2.7k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 5.8 years ago by enriqp02 ▴ 30

0

Entering edit mode

This Python script that I wrote downloads the FASTA sequence of all proteins matching a keyword, across all species. It is configurable, though. See if you can avail of it: A: How to download all sequences of a list of proteins for a particular organism

Edit: You are looking for the actual amino acid sequence, I presume?

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you! I used your script and works perfectly. But one question, if I type in NCBI: human, there are many values that I'm not interested.

In this case: Animals(1,419,740) Plants(4,494) Fungi(898,540) Protists(203,856) Bacteria(84,639,903) Archaea(6,043) Viruses(1,749,114)

In my case, I would like to use: "Homo sapiens"[Organism] but in this case, your script doesn't work. Is there any solution for this?

Thanks again

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30

0

Entering edit mode

I could see as well that only 20 sequences are downloaded in human :S.

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30

0

Entering edit mode

Any solution to this problem?

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30

0

Entering edit mode

Yes, for human data, just replace this line:

LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

...with this:

LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"

txid9606 is a reference to Homo sapiens

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for your reply.

I still having the same problem. Only download 20 sequences, as they appear on the website.

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30

score 0 · Answer 1 · 2018-06-28

0

Entering edit mode

5.8 years ago

GenoMax 141k

You can use NCBI eUtils which you will need to download and install.

 esearch -db protein -query "2[taxid] AND refseq [filter]" | efetch -format acc

(change acc to fasta if you want to get actual sequence). We are using 2 as taxID which is for bacteria (refine as needed) along with refseq as filter to get curated proteins only. If you replace 2 with 9606 you till get the same information for humans.

As for time there is not much you can do. This is going to depend on your internet speed (NCBI has plenty of bandwidth on their end). You could retrieve the proteins from blast indexes (nr) if you have those available. But that is still a large download so if you don't have those indexes it make take the same or longer as far as time goes.

ADD COMMENT • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

ProteomicaVI@DESKTOP-GTTSG80:~/Base_de_datos_Lola$  esearch -db Protein -query "9606[taxid] AND refseq [filter]" | efetch -format fasta
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain'
Result of do_post http request is
$VAR1 = bless( {
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'client-date' => 'Thu, 28 Jun 2018 12:07:46 GMT'
                                      }, 'HTTP::Headers' ),
                 '_request' => bless( {
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' ),
                                        '_content' => 'db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain',
                                        '_method' => 'POST',
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.34'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

WebEnv value not found in search output - WebEnv1
Db value not found in fetch input

I followed your steps but I got this bug.

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30

0

Entering edit mode

You need to install LWP::Protocol::https. If you have permissions to install things on this machine then follow the directions here. Use the appropriate one for OS you are using.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

I am not sure if this tool would exclude the protein data but maybe you can try https://github.com/kblin/ncbi-genome-download

ADD REPLY • link 5.8 years ago by Sej Modha 5.3k

score 0 · Answer 2 · 2018-06-30

0

Entering edit mode

5.8 years ago

tdmurphy ▴ 190

NCBI RefSeq includes nearly all bacteria proteins, and has files available for download at: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

ADD COMMENT • link 5.8 years ago by tdmurphy ▴ 190

0

Entering edit mode

Thanks a lot.

Do you know what file it is? There are many protein.faa

ADD REPLY • link 5.8 years ago by enriqp02 ▴ 30