Where to find 1000 Genome phase 3 whole genome data and select only European population
2
0
Entering edit mode
5.7 years ago
Opal • 0

Hello:

I was trying to download whole genome data from 1000Genome phase 3 data and extract only the EUR population (GBR, TSI, FIN, IBS, CEU). I used the ftp site:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz,

but apparently it is not the file I need, the error message says:

Error: No samples in .vcf file.

My question is where do I get the whole genome 1000Genome phase 3 data. Also, I checked Data slicer from EnsemblGRCh37, it allows population selection, but the maximum genome region to be extracted is 2.5Mb, so I can't get the whole genome data even if I succeed in downloading the whole genome dataset from the above ftp site (assume if it exists).

Opal

whole genome 1000Genome phase3 EUR population • 5.7k views
ADD COMMENT
3
Entering edit mode
2.6 years ago
Menno ▴ 30

It is not exactly what you are asking for, but I bumped into your question searching for this:

If you are looking for the plink files (.fam, .bim, .bed) for 1000G CEU phase 3 the following is a good source:

https://ctg.cncr.nl/software/magma

The following commands will download and unpack this data for you.

wget https://ctg.cncr.nl/software/MAGMA/ref_data/g1000_eur.zip 
unzip g1000_eur.zip
ADD COMMENT
0
Entering edit mode
5.7 years ago

The file that you want to download is 1.8 gigabytes. It will take a while to download, depending on your connection. Ensure that it downloads completely before trying to use it.

To view data in a vcf.gz file, use zcat or bcftools view, or just unzip it.


You can also download the data on a per-chromosome basis:

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22} X; do
    wget $prefix$chr$suffix $prefix$chr$suffix.tbi ;
done

You can then merge those into a single file or keep them separate. Either way, then download the 1000 Genomes PED file, which you can use for obtaining IDs for the purposes of filtering:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

Note that if you use mac, wget may not be installed. You can install it with brew install wget

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin,

I actually think the file

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz

is not the right file, because it says 'no sample in .vcf file.

The naming of this file is also different from the other chromsome-specific files as listed below (sites.vcf.gz instead of genotypes.vcf.gz

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

ADD REPLY
0
Entering edit mode

Hi Kevin,

I have downloaded the vcf.gz files for all the chromosomes 1-22 (I don't need X and Y). But they are pretty big files, is there any way to concatenate them without unzipping? Also, would you be able to elaborate how to use the .ped file to extract EUR (GBR, CEU, TSI, IBS, FIN) only population?

Opal

ADD REPLY
0
Entering edit mode

Hey Opal. Yes, I would even recommend converting them to BCF (binary call format), which saves even more space. You can then again use BCFtools to concatenate them, e.g., bcftools concat.

ADD REPLY

Login before adding your answer.

Traffic: 1378 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6