Get all bacterial protein sequences from Genbank but non redundant
1
0
Entering edit mode
6.9 years ago
mschmid ▴ 180

I would like to download all bacterial proteins from Genbank, but non redundant. Meaning that the entries which are exactly (!) identical get removed.

What is the most efficient way to do this?

I could use taxonomy browser and then get GIs of all bacterial products. But then it is redundant. The other problem is that I can not download the file with GIs since it seems to be too big. If I retrieve NCBI protein clusters I get entries removed which are not exactly the same, right?

Is it possible to get the faa sequences from all bacterial sequences in the nr blast db as a fasta via blastdb_aliastool?

Or do I have to use eutils?

ncbi bacteria protein • 1.7k views
ADD COMMENT
1
Entering edit mode
6.9 years ago
mschmid ▴ 180

I used this script. There will be some redundancy. But you can remove this later. Also, the reply from NCBI's server can have some error messages containted in the output. Filter for that!

#!/usr/bin/perl -w

use strict;
use warnings;
use LWP::Simple;
use warnings FATAL => 'all';

my $utils = "https://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" . "db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|.*<Count>(\d+).*<RetMax>(\d+).*<RetStart>(\d+).*<QueryKey>(\d+).*<WebEnv>(\S+)</WebEnv>.*|s;
print STDERR $esearch_result . "\n";
my $Count = $1;
my $retmax=1000;
my $retstart=$3;
my $QueryKey = $4;
my $WebEnv = $5;
my $efetch_result;
my $copy;
my $countSeqs;
my $expected;
print STDERR "Count = $Count; Retmax = $retmax; Retstart = $retstart; QueryKey = $QueryKey; WebEnv = $WebEnv\n";

while ($retstart<$Count)
{
  my $efetch = "$utils/efetch.fcgi?" . "rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" . "db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
  print STDERR "RETRIEVING $retmax results starting at result $retstart\n";
  my $myresult = eval {
    $efetch_result = get($efetch);
    $copy=$efetch_result;
    $countSeqs=0;
    if ($report eq 'fasta') { $countSeqs= $copy =~ tr/\>//; }
    elsif ($report eq 'genbank') { $countSeqs = $copy =~ tr/\/\///; }
    elsif ($report eq 'acc') { $countSeqs = $copy =~ tr/\n//; }
    $expected=$retmax;
    if ($retstart>$Count-$retmax) { $expected=$Count-$retstart; }
  };
  if($@)
  {
    print STDERR "PERL EVAL ERROR...TRYING AGAIN:" . "\n" . $@ . "\n";
  }
  else
  {
    if ($countSeqs>=($expected-1000)) { print "$efetch_result"; $retstart+=$retmax; }
    else { print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n"; }
  }
}

sub ask_user { print STDERR "$_[0] [$_[1]]: ";

my $rc = <>;
chomp $rc;
if($rc eq "")
{ die "Error: Empty field: $_[0]\n"; }
  return $rc;
}
ADD COMMENT

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6