Question

Is Zlib Library Failing To Process Properly Bgzip Files?

6

Entering edit mode

10.1 years ago

Pablo Marin-Garcia ★ 2.0k

bgzip files are backward compatible with gzip, but I have issues when using bgzip compressed vcf files with snpeff (java) or perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete.

perl code that does not work:

#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;

my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";

This gives 419 lines

$ perl /home/pmg/tmp/test_zlib-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 419

but using open with gzip pipe works:

#!/usr/bin/env perl
use strict;

my $infile = shift;
open(my $infh , 'gzip -dc '.$infile.' |');  # I can use bgzip intead gzip
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";

Gives the expected number of lines

$ perl /home/pmg/tmp/test_gzip-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 652829

I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16

vcf • 6.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 10.1 years ago by Pablo Marin-Garcia ★ 2.0k

1

Entering edit mode

10.1 years ago

SES 8.6k

I'm not sure this will fix the bug/feature causing the issue, but the best way to read this type of data in Perl would be to use the gzip IO layer.

use PerlIO::gzip;

open my $fh, "<:gzip", "file.gz" or die $!;
print while <$fh>; # And it will be uncompressed...

That isn't in Perl core but I think it's a favorite trick of Perl developers. If you don't have that, I usually just use zcat:

open my $fh, '-|', 'zcat', $file or die $!;
print while <$fh>; # And it will be uncompressed...

This list form allows you to pass arguments without metacharacters being interpreted by the shell, but I'm pretty sure this form is not allowed on Windows. If the solution you mentioned works, this second approach will certainly work.

ADD COMMENT • link 10.1 years ago by SES 8.6k

0

Entering edit mode

I have not used PerlIO::gzip but if it uses the zlib library probably would have the same problem. Calling gzip as a pipe works OK.

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 10.1 years ago by Pablo Marin-Garcia ★ 2.0k

1

Entering edit mode

You are correct. I actually tested PerlIO::gzip with some 1000 genomes data and it doesn't work, though the second solution does work (and is probably safer). For general usage though, PerlIO::gzip is awesome because it much easier than typing out all that IO::Uncompress stuff or using pipes and it's also more Perlish. In terms of speed, there is no difference between using a pipe with gzip -dc or the second solution I posted above.

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 10.1 years ago by SES 8.6k

0

Entering edit mode

When I was pondering which gzip implementation to use, I read that PerlIO is faster but I decided for IO:: because is core and I have put the code in a method so I only needed write once all this verbosity. Regarding the pipe part, I always use the three parts open like yours. Is easy to write the code for reading or writting to the pipe BUT I was not able to find out how to write to the pipe AND redirect to a file like "open my $out, '| gzip -c > ' . $filename;", so I use both compress and decompress as two arguments open for symetry

ADD REPLY • link 10.1 years ago by Pablo Marin-Garcia ★ 2.0k

Ram · Accepted Answer · 2014-02-28

5

Entering edit mode

10.1 years ago

lh3 33k

See the SAM spec about the bug/feature in the Java library - it only reads the first gzip block. There is also a related thread in BioStar. I guess Perl is the same. On Unix/Linux, I usually use:

open(FH, $ARGV[0] =~ /\.gz$/? "gzip -dc $ARGV[0] |" : $ARGV[0]) || die;
while (<FH>) {...}

ADD COMMENT • link 10.1 years ago by lh3 33k

0

Entering edit mode

Thanks Heng, I have read many times the SAM specs, even the bgzip part but I forgot the footnote were it explain the kind of bug that has caught me twice this week using snpeff and VEP: "[2]It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug."

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 10.1 years ago by Pablo Marin-Garcia ★ 2.0k

0

Entering edit mode

Thanks for posting this, it will surely save others some time in the future. (The quote is from the bottom of page 12 in the SAM specs, in case anyone is wanting a reference).

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 10.1 years ago by SES 8.6k

score 2 · Accepted Answer · 2015-04-13

2

Entering edit mode

9.0 years ago

Andy Yates ▴ 120

Hi. I have just encountered the same problem and it seems there is a pure-Perl solution surrounding this as shown below

my $status = gunzip("in.gz" => "out.gz", MultiStream => 1)
  or die "gunzip failed: $GunzipError\n";

It also seems Apache Commons Compress can provide similar support for Java. Hope this helps

ADD COMMENT • link 9.0 years ago by Andy Yates ▴ 120