Question

Perl Script To Fetch Mammals 3'Utrs From Ensembl

2

Entering edit mode

12.0 years ago

dannyjmh ▴ 20

Hi all. Does anyone know how to fetch all the 3'UTR sequences of the Ensembl EPO mammals using the Ensembl Perl API? Thanx in advance.

ensembl perl api • 5.3k views

ADD COMMENT • link updated 8.3 years ago by Emily 23k • written 12.0 years ago by dannyjmh ▴ 20

score 11 · Answer 1 · 2012-05-07

11

Entering edit mode

12.0 years ago

Steve Moss 2.3k

Here's something I coded that should do the job :)

#!/usr/bin/env perl
# Perl script to retrieve the 3'-UTR sequence for 12 eutherian mammals
# Coded by Steve Moss (gawbul [at] gmail [dot] com)
# http://stevemoss.ath.cx/

# make things easier
use strict;
use warnings;

#import EnsEMBL and BioPerl modules
use Bio::EnsEMBL::Registry;
use Bio::SeqIO;
use Data::Dumper;

# setup array of all epo mammals
my @epo_mammals = ("homo_sapiens", "gorilla_gorilla", "pan_troglodytes", "pongo_abelii", "macaca_mulatta", "callithrix_jacchus", "mus_musculus", "rattus_norvegicus", "equus_caballus", "canis_familiaris", "sus_scrofa", "bos_taurus");

# connect to EnsEMBL
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(    -host => 'ensembldb.ensembl.org',
                                    -user => 'anonymous');

# use Bio::SeqIO for write sequence to STDOUT
my $outseq = Bio::SeqIO->new(    -fh => \*STDOUT,
                            -format => 'FASTA');

# loop over the mammals
foreach my $mammal (@epo_mammals) {
    # get gene adaptor
    my $gene_adaptor = $registry->get_adaptor($mammal, 'Core', 'Gene');

    # fetch all genes
    my @gene_ids = @{$gene_adaptor->list_stable_ids()};

    # traverse through genes
    foreach my $gene_id (@gene_ids) {
        # get gene
        my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

        # get canonical transcript
        my $transcript = $gene->canonical_transcript();

        # check gene has a transcript associated
        unless (defined $transcript) {
            next;
        }

        # get 3'-UTR - returns a Bio::Seq object
        my $tputr = $transcript->three_prime_utr();

        # check transcript has a 3'-UTR annotated
        unless (defined $tputr) {
            next;
        }

        # print to STDOUT
        print $outseq->write_seq($tputr) . "\n";
    }
}

This currently prints the sequence in FASTA format to STDOUT, but you could replace -fh => \STDOUT,* with -file => ">filename", to write to a file instead.

Also, the default Bio::Seq object returned by the transcript object's three_prime_utr method, contains only the sequence and the associated transcript stable_id as the display_id. I would probably create a custom Seq object with gene stable_id, transcript stable_id, and the start and end coordinates of the 3'-UTR. As the three_prime_utr method only returns the Bio::Seq object, it is necessary to determine the start and end of the 3'-UTR based on the start of the transcript and the start of the coding region respectively (or the end of both, if it is on the alternate strand).

ADD COMMENT • link 12.0 years ago by Steve Moss 2.3k

0

Entering edit mode

Hello Steve! Thanks so much for the help, the code works just fine. I also need a list of the transcripts for which there are no annotated UTRs. What would be the modification to the code? I'm not familiar with OO Perl at all. Again than you so much.

ADD REPLY • link 12.0 years ago by dannyjmh ▴ 20

0

Entering edit mode

In the part that says "unless (defined $tputr) {", you could add "print $transcript->stableid();" above the "next;" statement? You might want to add "my $fputr = $transcript->fiveprime_utr();" below the "my $tputr..." bit and check to see if both $tputr and $fputr are defined? It depends if you want to print the transcript ID if both, or one or the other of the UTRs aren't annotated. Bare in mind that it is likely to be annotation, rather than biologically real lack of UTRs.

ADD REPLY • link 12.0 years ago by Steve Moss 2.3k

0

Entering edit mode

It would be useful to get a basic grasp of Perl though, if you want to get to grips with the API better. You can find information on the individual parts of the (Core) API here http://www.ensembl.org/info/docs/Doxygen/core-api/index.html.

ADD REPLY • link 12.0 years ago by Steve Moss 2.3k

0

Entering edit mode

Hey again, Steve! Thanks for all the help, my friend. Yeah, I'll start on object-oriented Perl. I can do some Perl but only for text manipulation, not CGI or DBI. Thanks again.

ADD REPLY • link 12.0 years ago by dannyjmh ▴ 20

0

Entering edit mode

No problem :) Any of the O'Reilly Perl books are great for getting to grips with things - http://shop.oreilly.com/category/browse-subjects/programming/perl.do and http://search.oreilly.com/?q=bioinformatics&x=0&y=0

ADD REPLY • link 12.0 years ago by Steve Moss 2.3k

0

Entering edit mode

I'm gonna get some of them no matter what. Two more questions(the last ones i promise): will the script get all of the transcripts reported for each gene? And how to print the resulting fasta file like:

EnsemblTranscriptID|EnsemblGeneID Sequence ? Million thanks.

ADD REPLY • link 11.9 years ago by dannyjmh ▴ 20

0

Entering edit mode

At the moment this is only getting the canonical transcript that is annotated for each gene. In order to get all transcripts you would need to replace my $transcript = $gene->canonicaltranscript(); with my @transcripts = @{$gene->getall_Transcripts()}; and then iterate over this using foreach my $transcript (@transcripts) { ... } where ... Is the code for retrieving the UTRs.

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

0

Entering edit mode

I think you can just add $tputr->desc = $transcript->stableid . "|" . $gene->stableid; and it should add those to the fasta description line (not currently at the 'puter to check that) :-)

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Done, Steve. The latter gives the error "Can't modify non-lvalue subroutine call". But I modified the script to print in another file EnsemblTranscriptID|EnsemblGeneID lines (print AMPL ${geneid} . "|" . $transcript->stableid() . "\n";). And added it to each transcript with another simple script. Zillion thanks one more time. Next step for me: Obj-oriented Perl.

ADD REPLY • link 11.9 years ago by dannyjmh ▴ 20

score 4 · Answer 2 · 2012-05-07

4

Entering edit mode

12.0 years ago

Giovanni M Dall'Olio 28k

Did you look at the Ensembl API Tutorial? There is an example just about retrieving 3' UTRs: http://www.ensembl.org/info/docs/api/core/core_tutorial.html#genes

You can just copy & paste the example in the tutorial and change the name of the species. Note that the example fetches all the UTR in the genes in a chromosomal region: if you want to specify a single gene, you can just use a gene adaptor instead of a slice adaptor.

If you have any specific problem in using the APIs, you can get better answers by editing your question and specifying what is causing you trouble.

ADD COMMENT • link 12.0 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hey, Giovanni! Thanks for the tip.

ADD REPLY • link 12.0 years ago by dannyjmh ▴ 20

score 1 · Answer 3 · 2012-05-10

1

Entering edit mode

11.9 years ago

Guangchuang Yu ★ 2.6k

I have a script in R to fetch 3UTR via biomaRt: http://ygc.name/2011/03/02/easiest-utr-sequence/

ADD COMMENT • link 11.9 years ago by Guangchuang Yu ★ 2.6k