Off topic:How To Decompress 1000Genomes Bgzip-Compressed Files Using Java
2
1
Entering edit mode
13.1 years ago

(cross posted on stack-overflow )

Ok. I'm puzzled.

I'm want to use java to download ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz in order to annotate some VCFs on the fly. (I don't want to download this file on my desktop, I really want to stream its bytes )

But the program stops after reading a few lines.

Here is the minimal program for this problem:

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
public class Test
    {
    public static void main(String args[]) throws Exception
        {
        int count=0;
        URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
        String line;
        BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
        while((line=in.readLine())!=null)
            {
            ++count;
            System.err.println("["+count+"] "+line);
            }
        in.close();
        System.out.println("Done. nLines="+count);
        }
    }

Compile and run:

javac Test.java
java -Dftp.proxyHost=${MYPROXYHOST} -Dftp.proxyPort=${MYPROXYPORT} Test

And the output stops prematurely after the 1012th line (from my home and from my work place):

(...)
[999] 1    750138    rs61770171    G    A    .    PASS    DP=2189;AF=0.083;CB=UM,BI;EUR_R2=0.129;AFR_R2=0.164
[1000] 1    750153    .    T    C    .    PASS    DP=2555;AF=0.016;CB=UM,BI,BC;EUR_R2=0.167;AFR_R2=0.281
[1001] 1    750190    .    C    T    .    PASS    DP=3515;AF=0.003;CB=UM,BI;EUR_R2=0.581;AFR_R2=0.575
[1002] 1    750235    .    G    A    .    PASS    DP=3914;AF=0.019;CB=UM,BI,BC;EUR_R2=0.719;AFR_R2=0.733
[1003] 1    750436    .    C    T    .    PASS    DP=598;AF=0.020;CB=BI,BC;EUR_R2=0.144;AFR_R2=0.355
[1004] 1    750511    .    G    A    .    PASS    DP=806;AF=0.010;CB=BI,BC;AFR_R2=0.352
[1005] 1    750718    .    G    A    .    PASS    DP=2751;AF=0.003;CB=UM,BI,BC;EUR_R2=0.54;AFR_R2=0.545
[1006] 1    750897    .    G    A    .    PASS    DP=744;AF=0.010;CB=BI,BC;AFR_R2=0.479
[1007] 1    750946    .    A    G    .    PASS    DP=873;AF=0.010;CB=BI,BC;AFR_R2=0.414
[1008] 1    751043    .    G    A    .    PASS    DP=1522;AF=0.000;CB=BI,BC;EUR_R2=0.273
[1009] 1    751281    .    T    C    .    PASS    DP=403;AF=0.010;CB=BI,BC;AFR_R2=0.178
[1010] 1    751343    .    T    A    .    PASS    DP=1912;AF=0.117;CB=UM,BI;EUR_R2=0.683;AFR_R2=0.582
[1011] 1    751456    .    T    C    .    PASS    DP=1775;AF=0.008;CB=UM,BI;EUR_R2=0.515;AFR_R2=0.332
[1012] 1    
Done. nLines=1012

I was not the only one to have this problem: http://twitter.com/#!/neilswainston/status/43301088757157888

Re: 1000 genome. Don't think it's your problem. Try downloading it and uncompressing it manually - same result... (66kb file).

Using internet explorer and winrar , it was said that the file was corrupted.

Using firefox for downloading the file, the browser said:

"Content Encoding Error :The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression. Please contact the website owners to inform them of this problem."

Using curl: it worked !!!

>curl "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" -o ALL.2of4intersection.20100804.sites.vcf.gz
  % Total    % Received % Xferd  Average Speed  Time    Time    Time  Current
                                Dload  Upload  Total  Spent    Left  Speed
100  388M  100  388M    0    0  414k      0  0:16:00  0:16:00 --:--:--  566k
> md5sum ALL.2of4intersection.20100804.sites.vcf.gz
da386f5e2e0fa7e92c64e79691d0a8b8  ALL.2of4intersection.20100804.sites.vcf.gz ##CORRECT
> gunzip -t ALL.2of4intersection.20100804.sites.vcf.gz
> ls -la ALL.2of4intersection.20100804.sites.vcf
-rw-r--r-- 1 lindenb lindenb 1947373891 2011-03-03 17:39
ALL.2of4intersection.20100804.sites.vcf

Why ? what's happening ? how can it be fixed ? is there a problem with a bgzip compression ?

Thanks for your help.

Pierre

UPDATE:

I solved my problem by using net.sf.samtools.util.BlockCompressedInputStream instead of GZipInputStream. The following code works:

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import net.sf.samtools.util.BlockCompressedInputStream;
public class Test
    {
    public static void main(String args[]) throws Exception
        {
        URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
        String line;
        int nRead=0;
        BufferedReader in= new BufferedReader(new InputStreamReader(new BlockCompressedInputStream(url.openStream())));
        while((line=in.readLine())!=null)
            {
            System.out.println(line);
            }
        in.close();
        System.out.println("Done.");
        }
    }
genome java • 8.5k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6