Biostar Beta. Not for public use.
Question: Perl To Retrieve Sequences From Ucsc Genome Browser
7
Entering edit mode

Dear all, I am going to get DNA sequence by its given chromosome position from the website of UCSC (link text), i.e, when I click the button 'get DNA', there would be a new html page produced, in which is my target DNA sequence; actually, for only one DNA sequence query, I can simply copy the result sequence, but I want to use perl to automate the procedure for many sequences (not more than 20). Firstly, I tried to mine the html codes and use perl HTML PARSER or LWP::useragent to accomplish the job, but I am not good at this, finally, I have to ask for your guys' help. Would you please show me how to do that? Thanks very much!

I finally get the solution for this problem with perl, and the following is my perl code to use DAS of UCSC to fetch DNA sequence by its given chromosome position:

#!/usr/bin/perl

use LWP::Simple;

  #Use DAS of UCSC to fetch specific sequence by its given chromosome position

  $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,200000";

  $genefile = get($URL_gene);

  @DNA=grep {
             /^[acgt]*$/i;
                    } split("\n",$genefile);

  print @DNA,"\n";
ADD COMMENTlink 9.0 years ago Cheng Zhongshan • 400 • updated 5.4 years ago Biostar 20
Entering edit mode
0

Tip: indent your code with 4 spaces to format it properly.

ADD REPLYlink 9.0 years ago
Neilfws
48k
Entering edit mode
0

yould should not (never) parse a XML document like this.

ADD REPLYlink 9.0 years ago
Pierre Lindenbaum
120k
Entering edit mode
0

Exactly, you got the query right, now you have to apply the same care for the response handling. btw. check your code for a short sequence of length 10, or sequence starting with a whitespace or containing N it breaks. Instead of relying on random whitespace, use a perl XM parser module e.g. a subclass of XML::Parser.

ADD REPLYlink 9.0 years ago
Michael Dondrup
46k
Entering edit mode
0

Yeah, you are right in that case. I will revise it according to all your guys's suggestions. Thanks very much for all!

ADD REPLYlink 9.0 years ago
Cheng Zhongshan
• 400
Entering edit mode
0

Can biomart do this job? I tried but found biomart only provides human gene sets. IF biomart has human whole genome, this work can be done using biomaRt package in R.

ADD REPLYlink 8.8 years ago
Dejian
♦ 1.3k
12
Entering edit mode

The simple answer is: don't! Parsing HTML pages to get the result is highly error-prone tedious, and easy to break. It is considered a last resort only in cases where there is no other way to access the content of the resource in a programmatic way.

For UCSC programmatic interfaces would either be using DAS: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download23 or their public mysql server: http://genome.ucsc.edu/FAQ/FAQdownloads#download29


Edit: Added a code example for parsing the XML using XML::XPath with the DAS link you made. This will fetch the sequence more reliably:

#!/usr/bin/env perl

use strict;
use warnings;
use LWP::Simple;
use XML::XPath;
use XML::XPath::XMLParser;

#Use DAS of UCSC to fetch specific sequence by its given chromosome position

my $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,100100";

my $xml = get($URL_gene);

my $xp = XML::XPath->new(xml=>$xml);

my $nodeset = $xp->find('/DASDNA/SEQUENCE/DNA/text()'); # find all sequences
# there should be only one node, anyway:    
foreach my $node ($nodeset->get_nodelist) {
   my $seq = $node->getValue;
   $seq =~ s/\s//g; # remove white spaces
   print $seq, "\n";
}
ADD COMMENTlink 9.0 years ago Michael Dondrup 46k
Entering edit mode
0

Thanks, actually, my first thought was the same as you suggested, but I want to make my perl codes simple and integrate it into another part of code without too many software involved in. Anyhow, I will think my plan again.

ADD REPLYlink 9.0 years ago
Cheng Zhongshan
• 400
12
Entering edit mode

I think you took the wrong turn at some point. If what you need to do is to extract DNA sequences from the human genome based on coordinates, querying the UCSC genome browser and scraping the resulting web pages is _not_ the way to go. You would be much better off downloading their latest assembly of the complete genome, and extract the wanted sequences from that.

ADD COMMENTlink 9.0 years ago Lars Juhl Jensen 11k
Entering edit mode
0

You are right, and if I download the whole genome, which would be a good start of long time mantaining of programming codes.Thanks!

ADD REPLYlink 9.0 years ago
Cheng Zhongshan
• 400
Entering edit mode
0

Here is the answer from GenomeBrowser FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download32

ADD REPLYlink 6.4 years ago
ivan.antonov
• 70
6
Entering edit mode

I was given a Perl script by the UCSC helpdesk that i adapted to run on multiple coordinates.

Requires CPAN 'Bio::Das', which i got via 'perl -MCPAN -e shell'' The input is a file with one or more lines of: "chr start end label" (label is anything you like)

use strict;
use Bio::Das;

#### USAGE
unless(@ARGV == 2) {
   die("USAGE: $0 | Input BED file | Output FASTA file\n\n");
}

#### Access files
open(INPUT, "<$ARGV[0]") or die("Could not open input!\n");
open(OUTPUT, ">$ARGV[1]") or die("Could not open output!\n");

#### DAS db
my $das = Bio::Das->new(-source => 'http://genome.cse.ucsc.edu/cgi-bin/das/', -dsn=>'hg19');

#### Line counter
my $count = 0;

#### Work thru each line of BED file
while(defined(my $line = <INPUT>)) {
   #### Make sure line begins with chr
   unless($line =~ /^chr/) { next }

   #### Count lines
   $count++;
   print "#$count $line\n";

   #### Split line
   my ($chr, $start, $end, $label) = '';
   my @line_bits = ();

   @line_bits = split(/\t/, $line);

   $chr = $line_bits[0];
   $start = $line_bits[1];
   $end = $line_bits[2];
   $label = $line_bits[3];
   chomp($label);

   #### Define segment of genome
   my $segment = $das->segment("$chr\:$start\,$end");

   #### Get DNA
   my $dna = $segment->dna;

   #### Print sequence to output
   print OUTPUT "\>$chr:$start-$end\_$label\n$dna\n";
}

#### Close files
close(INPUT);
close(OUTPUT);

exit;
ADD COMMENTlink 9.0 years ago Ian 5.4k • updated 9.0 years ago Casey Bergman 18k
4
Entering edit mode

As the other answers said, it's usually not good to do things this way and there are often alternatives (downloads, APIs). However, if you must...

...I'd suggest looking at WWW Mechanize. It is a library which automates interaction with websites and can parse whatever is returned.

There is some documentation at the website (cookbook; examples) and because it is used widely, many tutorials such as this one. Just Google search "perl mechanize tutorial".

There are also Mechanize libraries for other languages. I find the Ruby version very intuitive, concise and easy to use.

ADD COMMENTlink 9.0 years ago Neilfws 48k
Entering edit mode
1

Yes, there is certainly no need to scrape a browser in order to retrieve sequences.

ADD REPLYlink 9.0 years ago
Neilfws
48k
Entering edit mode
0

If one wanted to actually scrape a web site, I would agree with you. But I think scraping the UCSC genome browser is the entirely wrong angle of attack in this case ;-)

ADD REPLYlink 9.0 years ago
Lars Juhl Jensen
11k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0