Biostar Beta. Not for public use.
Question: Perl script running slow with big files
2
Entering edit mode

Hi all,

I've written one perl script, which compares two vcf files and write output in third file. it is running good with small files, but when i am inputting big files in Gbs, my system becomes slow and it would hang.

I'm using dbSNP vcf file as input which is around 9GB.

Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl.

Any help would be appreciated !!

Thank you !!!!


#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use List::MoreUtils;


open (FILE, "<", "dbSNP_in.vcf") or die "failed to open: $!\n";
my @array=(<FILE>);
my @CHR;
my @location;
my @rs;
my @ref_n;
my @alt_n;
foreach (@array)
{
    chomp;
    my ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split(/\t/, $_);

push @CHR, $chrom;
push @location, $pos;
push @rs, $id;
push @ref_n, $ref;
push @alt_n, $alt;
}


open (FILE1, "<", "trial_rs.vcf") or die "failed to open: $!\n";
my @array1=(<FILE1>);


open (OUT,">trial_output.vcf");
my @columns;

foreach (@array1)
{
    chomp;
    @columns=split(/\t/, $_);
    my $i;
    for ($i=0; $i<@array; $i++)
    {
        if (($columns[0] eq $CHR[$i]) and ($columns[1] eq $location[$i]) and ($columns[3] eq $ref_n[$i]) and ($columns[4] eq $alt_n[$i]))
        {
            $columns[2]=$rs[$i];


        }

    }

print OUT join("\t", @columns), "\n";
}
ADD COMMENTlink 5.6 years ago cvu • 130 • updated 5.6 years ago ole.tange ♦ 3.4k
Entering edit mode
0

"Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl."

It seems obvious that one needs to post your script for this purpose, doesn't it?

ADD REPLYlink 5.6 years ago
Michael Dondrup
46k
Entering edit mode
0

This is the script

... moved ...

ADD REPLYlink 5.6 years ago
cvu
• 130
• updated 5.6 years ago
Michael Dondrup
46k
7
Entering edit mode

The problem is this:

my @array=(<FILE>); ## slurp a whole Terabyte into RAM

Do not read a file like this, unless it is very small or your memory is huge, instead use:

while (<FILE>) { # read each line, then forget about it

chomp;

split /t/;

....

}

In addition your code is extremely inefficient:

You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant time access to elements given the hash key. Use a hash of hashes to store your filter table.

You seem to filter a large file by a small file, therefore:

  • process the small file first (using while construct), parse the small file into a hash using the location as key of a nested hash
  • process the large file as above and look up each of the entries in the hash created before

As a result you will only need as much memory as is required to store the small file.

ADD COMMENTlink 5.6 years ago Michael Dondrup 46k
Entering edit mode
0

I guess I should have refreshed the page before answering, since my answer just echoes what you already wrote!

ADD REPLYlink 5.6 years ago
Devon Ryan
90k
Entering edit mode
0

Ok i'll try this way.

Thanks a lot Michael Dondrup !!!!!!

ADD REPLYlink 5.6 years ago
cvu
• 130
3
Entering edit mode

Consider the following refactoring:

use strict; use warnings; use autodie;

my ( @array, %hash );

open my $FILE, "<", "dbSNP_in.vcf";

while (<$FILE>) { $hash{"@array[0, 1, 3, 4]"} = $array[2] if @array = split /\t/, $_, 6; }

close $FILE;

open my $OUT, ">", "trial_output.vcf"; open my $FILE1, "<", "trial_rs.vcf";

while (<$FILE1>) { my @columns = split /\t/;

$columns[2] = $hash{"@columns[0, 1, 3, 4]"} if exists $hash{"@columns[0, 1, 3, 4]"};

print $OUT join "\t", @columns; }

close $OUT; close $FILE1;

  • autodie is used to trap all i/o errors.
  • chomp is not needed within either loop.
  • splitting only the amount needed is faster than splitting all.
  • An interpolated array slice ("@array[0, 1, 3, 4]", which produces a string comprised of elements 0, 1, 3, 4) is used as a hash key with a corresponding value of array element 2--for use later.
  • An interpolated array slice, as a hash key, is tested for existence, and $columns[2] is set to that key's corresponding value if that key exists. Doing this, instead of using four eq statements, should speed up the file processing.

Hope this helps!

ADD COMMENTlink 5.6 years ago Kenosis ♦ 1.2k
Entering edit mode
0

Thanks it helped me a lot !!

ADD REPLYlink 5.6 years ago
cvu
• 130
Entering edit mode
0

Are you sure chomp is not needed? Way back in time I learned that it is always good and safe to chomp. What sometimes happens to me when I forget to use chomp is that I get the newline character in the last column, this can yield hard to debug errors if you use a string from the last column.

ADD REPLYlink 5.6 years ago
Michael Dondrup
46k
Entering edit mode
0

My apologies for not explaining my claim...

In the first while loop, only the first five fields are used; the remaining are discarded, including the last field which contains the record separator that chomp would remove. In the second while loop, the record separator is retained and used later when the fields are joined and printed.

ADD REPLYlink 5.6 years ago
Kenosis
♦ 1.2k
2
Entering edit mode

I'd suggest using Tabix to index the larger file and use the tabix perl module to query it for each locus in the smaller VCF file. That's likely the most optimal solution in terms of time and memory.

http://samtools.sourceforge.net/tabix.shtml

ADD COMMENTlink 5.6 years ago Vivek ♦ 2.3k
1
Entering edit mode

There are better answers that address your specific issue. But they do not answer the general problem: What do you do if your perl script is slow?

You profile your code. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention:

perl -d:NYTProf testme.pl; 
nytprofcg; 
kcachegrind nytprof.callgrind

When you cannot make that faster, see if you can somehow parallelize the code. Can you divide your input into chunks, compute each chunk individually and merge the output when you are done? This technique is called map-reduce.

For smaller jobs (requiring fewer than 1000 CPUs) GNU Parallel can often be of help in doing the chunking of input and running of jobs. You just need to merge the final output.

If you still need more speed, you will need to look at a compiled language such as C++. Again you will use the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ you can address the small part of the code that is the bottleneck. The good part about this is that much of error handling and corner cases often can be left to Perl.

ADD COMMENTlink 5.6 years ago ole.tange ♦ 3.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0