(cross posted on stack-overflow )
Ok. I'm puzzled.
I'm want to use java to download ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz in order to annotate some VCFs on the fly. (I don't want to download this file on my desktop, I really want to stream its bytes )
But the program stops after reading a few lines.
Here is the minimal program for this problem:
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
public class Test
{
public static void main(String args[]) throws Exception
{
int count=0;
URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
String line;
BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
while((line=in.readLine())!=null)
{
++count;
System.err.println("["+count+"] "+line);
}
in.close();
System.out.println("Done. nLines="+count);
}
}
Compile and run:
javac Test.java
java -Dftp.proxyHost=${MYPROXYHOST} -Dftp.proxyPort=${MYPROXYPORT} Test
And the output stops prematurely after the 1012th line (from my home and from my work place):
(...)
[999] 1 750138 rs61770171 G A . PASS DP=2189;AF=0.083;CB=UM,BI;EUR_R2=0.129;AFR_R2=0.164
[1000] 1 750153 . T C . PASS DP=2555;AF=0.016;CB=UM,BI,BC;EUR_R2=0.167;AFR_R2=0.281
[1001] 1 750190 . C T . PASS DP=3515;AF=0.003;CB=UM,BI;EUR_R2=0.581;AFR_R2=0.575
[1002] 1 750235 . G A . PASS DP=3914;AF=0.019;CB=UM,BI,BC;EUR_R2=0.719;AFR_R2=0.733
[1003] 1 750436 . C T . PASS DP=598;AF=0.020;CB=BI,BC;EUR_R2=0.144;AFR_R2=0.355
[1004] 1 750511 . G A . PASS DP=806;AF=0.010;CB=BI,BC;AFR_R2=0.352
[1005] 1 750718 . G A . PASS DP=2751;AF=0.003;CB=UM,BI,BC;EUR_R2=0.54;AFR_R2=0.545
[1006] 1 750897 . G A . PASS DP=744;AF=0.010;CB=BI,BC;AFR_R2=0.479
[1007] 1 750946 . A G . PASS DP=873;AF=0.010;CB=BI,BC;AFR_R2=0.414
[1008] 1 751043 . G A . PASS DP=1522;AF=0.000;CB=BI,BC;EUR_R2=0.273
[1009] 1 751281 . T C . PASS DP=403;AF=0.010;CB=BI,BC;AFR_R2=0.178
[1010] 1 751343 . T A . PASS DP=1912;AF=0.117;CB=UM,BI;EUR_R2=0.683;AFR_R2=0.582
[1011] 1 751456 . T C . PASS DP=1775;AF=0.008;CB=UM,BI;EUR_R2=0.515;AFR_R2=0.332
[1012] 1
Done. nLines=1012
I was not the only one to have this problem: http://twitter.com/#!/neilswainston/status/43301088757157888
Re: 1000 genome. Don't think it's your problem. Try downloading it and uncompressing it manually - same result... (66kb file).
Using internet explorer and winrar , it was said that the file was corrupted.
Using firefox for downloading the file, the browser said:
"Content Encoding Error :The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression. Please contact the website owners to inform them of this problem."
Using curl: it worked !!!
>curl "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" -o ALL.2of4intersection.20100804.sites.vcf.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 388M 100 388M 0 0 414k 0 0:16:00 0:16:00 --:--:-- 566k
> md5sum ALL.2of4intersection.20100804.sites.vcf.gz
da386f5e2e0fa7e92c64e79691d0a8b8 ALL.2of4intersection.20100804.sites.vcf.gz ##CORRECT
> gunzip -t ALL.2of4intersection.20100804.sites.vcf.gz
> ls -la ALL.2of4intersection.20100804.sites.vcf
-rw-r--r-- 1 lindenb lindenb 1947373891 2011-03-03 17:39
ALL.2of4intersection.20100804.sites.vcf
Why ? what's happening ? how can it be fixed ? is there a problem with a bgzip compression ?
Thanks for your help.
Pierre
UPDATE:
I solved my problem by using net.sf.samtools.util.BlockCompressedInputStream instead of GZipInputStream. The following code works:
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import net.sf.samtools.util.BlockCompressedInputStream;
public class Test
{
public static void main(String args[]) throws Exception
{
URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
String line;
int nRead=0;
BufferedReader in= new BufferedReader(new InputStreamReader(new BlockCompressedInputStream(url.openStream())));
while((line=in.readLine())!=null)
{
System.out.println(line);
}
in.close();
System.out.println("Done.");
}
}