Biostar Beta. Not for public use.
How To Retrieve A Protein Sequence Given An Ensembl Gene Id Using Perl
1
Entering edit mode
7.4 years ago
@Saad Murtaza Khan1607

Hi i have a list of ensembl gene id's i need to get their corresponding protein sequences using perl.Kindly suggest how to achieve this using ensemblAPI

perl ensembl homework • 7.3k views
ADD COMMENTlink
7
Entering edit mode

answered largely here: http://biostar.stackexchange.com/questions/2080/help-with-downloading-sequences-from-ensembl

ADD REPLYlink
3
Entering edit mode

have you tried anything so far? Is there a specific thing you are stuck with?

ADD REPLYlink
2
Entering edit mode

I suggest reading the documentation and trying the examples. Then ask again if you have difficulty.

ADD REPLYlink
0
Entering edit mode

I suggest reading the documentation and trying the examples ;-)

ADD REPLYlink
0
Entering edit mode

I can say from my own experience with EnsEMBL that this task isn't as easy as normally perceived. A gene id can be linked to possibly many transcripts. Each one can be linked to a protein id. But, many protein id represent exactly the same protein, differing only at transcript level. Many genes don't have the is_canonical attribute set to a value. So, I suggest a more precise specification of your question. Otherwise, go to biomart, paste your gene id list as a filter and download the data as a csv file.

ADD REPLYlink
6
Entering edit mode
9.8 years ago
Thaman ♦ 3.2k
@Thaman313

It's clearly stated in the Ensembl core API tutorial that you can get protein sequence from Transcript object.

Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier

my $stable_id = 'ENST00000044768';

my $transcript_adaptor =
  $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);

print $transcript->translation()->stable_id(), "\n";
print $transcript->translate()->seq(),         "\n";

Is it that hard to go through the documentation?

ADD COMMENTlink
0
Entering edit mode

In addition, all objects in the Ensembl core API and their methods can be found here: http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html.

ADD REPLYlink
0
Entering edit mode

Thanks for addition mentoring!

ADD REPLYlink
3
Entering edit mode
9.8 years ago
@Giulietta - Ensembl Helpdesk966

This is in the Ensembl documentation as has been pointed out. You say you need to go through the Perl API- but this would actually be easier in BioMart. If that's an option for you, watch this tutorial video.

There is a BioMart web interface you can use. Filters would be your IDs, and Attributes would be the sequences page, protein sequences.

0
Entering edit mode

BioMart works well! Extrally, you can use a R package called biomaRt to achieve this!

ADD REPLYlink
0
Entering edit mode
9.8 years ago
@Panagiotis Alexiou997

I believe you could use the Ensembl API that is provided by Ensembl and can be found at their site. It allows perl programs to access their database.

If your question is more specific it would be nice to know.

ADD COMMENTlink
0
Entering edit mode
7.4 years ago
@Kanhu charan Moharana8816

Here is a PERL script using LWP::Simple Module, to retrive any kind of sequence linked to a Ensemble Transcript ID. It worked for me, hope other can use it with simple modification.

Usage: perl SCRIPT_NAME.pl FILE_CONTAINING_ENSEMBL_ID


YOu can edit the script to fetch specific annotations, like cds sequence, cdna, peptide, exons or introns.

+++++++++++++++++++++PERL CODE+++++++++++++++++++++

### Script to retrive ensembl sequence using ensembl trascript ID

use strict;
use LWP::UserAgent;
use LWP::UserAgent;
use LWP::Simple;
use HTTP::Cookies;


my $input_file=shift|| die "Insufficient Parameters!!!\n Usage: perl $0  <FILE CONATIING_ENSEMBLE_IDS>\n File must have one id per line.\n";

open(IN,"$input_file") or die "$! $input_file\n";
my @inputs=<IN>;
print STDERR "You have entered ".scalar @inputs." IDs\n\n";
my $ensmbl_ids=join "",@inputs;

$ensmbl_ids=~s/\n/\t/g;
#print "$ensmbl_ids\n";

my $flank3_display=0;            ##upstream, downstream
my $flank5_display=0;
my $strand='strand';                ## 1, forwd or -1 revrese
my $output='fasta';                ## output format, bed,csv,tab, gtf, gff, gff3, embl, genbank

my $fasta_genomic='off';        #unmasked,soft_masked, hard_masked, 5_flanking, 3_flanking, 5_3_flanking

########################EDIT TYPE OF SEQUENCE TO FETCH######################################
#use 0 to turn off and 1 to turn on; default all 'ON'
my $cdna='1';
my $coding='1';
my $peptide='1';
my $utr5='1';
my $utr3='1';
my $exon='1';
my $intron='1';
#############################################################################################




#===================UNIPRTO BOT=====================source: uniprot site
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';

my $params = {
  to => 'ACC',
  from => 'ENSEMBL_TRS_ID',                    
  format => 'tab',
  query =>  $ensmbl_ids,
};

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/", $params);

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}


my %ensemle_id_acc_id;
if($response->is_success ){ my @l=split (/\n/, $response->content);  foreach my $l(@l) {my($k,$v)=split(/\s+/,$l); $ensemle_id_acc_id{$k}=$v if $k ne 'From';  }    }
else{die 'Failed, got ' . $response->status_line .    ' for ' . $response->request->uri . "\n";}


foreach(sort keys %ensemle_id_acc_id)
{
    print "#Ensembl_ID=$_\tUniprot_ACC_ID: $ensemle_id_acc_id{$_}\n";
    my $uniprot_url='http://www.uniprot.org/uniprot/'.$ensemle_id_acc_id{$_}.'.txt';
    my $content_uniprot = get $uniprot_url;
    my $org_code;
    if($content_uniprot=~ m/OS\s+(\S+)\s+(\S+)\s*/i) {

     $org_code=lc($1)."_".lc($2);            ## Fetching Organism name from Uniprot;

    print "#Uniprot Organism: $1 $2\n";            
    if($org_code)
            {
                ## constructing ensembl URL
                my $ensembl_url='http://www.ensembl.org/'.$org_code.'/Export/Output/Transcript?db=core;'.'flank3_display='.$flank3_display.';flank5_display='.$flank5_display.';output='.$output.';strand='.$strand.';t='.$_.';';



                $ensembl_url.="param=cdna;" if($cdna);
                $ensembl_url.="param=coding;" if $coding;
                $ensembl_url.= "param=peptide;" if  $peptide;
                $ensembl_url.="param=utr5;"  if $utr5;
                $ensembl_url.="param=utr3;"  if $utr3;
                $ensembl_url.="param=exon;" if $exon;
                $ensembl_url.="param=intron;" if $intron;

                $ensembl_url.='genomic=off;_format=Text';
                #print "$ensembl_url\n";
                my $content_ensembl_seq = get $ensembl_url;
                print "$content_ensembl_seq\n";
            }    

      } 
      else {
        print "!!!ORG CODE ERROR!!! : $_\n";
      }    
print "//\n";        
}
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3