what is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets?
1
0
Entering edit mode
4.9 years ago

What is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets nowadays?

About 5 years or so ago, it used to be the latest release of the CSHL HapMap 12 populations that were part of the 1000 genomes project. These were:

CHB CHD MEX GIH TSI LWK ASW MKK HCB JPT CEU YRI

For example, the EnsEMBL project currently has these 12 populations available as a vcf here:

http://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/

CSHL-HAPMAP-HAPMAP-CHB.vcf.gz
CSHL-HAPMAP-HAPMAP-CHD.vcf.gz
CSHL-HAPMAP-HAPMAP-MEX.vcf.gz
CSHL-HAPMAP-HAPMAP-GIH.vcf.gz
CSHL-HAPMAP-HAPMAP-TSI.vcf.gz
CSHL-HAPMAP-HAPMAP-LWK.vcf.gz
CSHL-HAPMAP-HAPMAP-ASW.vcf.gz
CSHL-HAPMAP-HAPMAP-MKK.vcf.gz
CSHL-HAPMAP-HapMap-HCB.vcf.gz
CSHL-HAPMAP-HapMap-JPT.vcf.gz
CSHL-HAPMAP-HapMap-CEU.vcf.gz
CSHL-HAPMAP-HapMap-YRI.vcf.gz

Each SNP has an AF entry from which a multi-populations vcf with rsids, alleles and frequencies can be built, where the AFs are such as AF_CHB, AF_CHD, etc.

The 1000 genomes project populations documentation describes more than these 12 populations, but I haven't seen equivalent population vcfs with AFs built from the individuals within the population for the remainder of these, apart from the original HapMap 12 marked with a * below:

###Populations and codes

*         CHB   Han Chinese             Han Chinese in Beijing, China
*         JPT   Japanese                Japanese in Tokyo, Japan
         CHS    Southern Han Chinese    Han Chinese South
         CDX    Dai Chinese             Chinese Dai in Xishuangbanna, China
         KHV    Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
*         CHD   Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)

         CEU    CEPH                    Utah residents (CEPH) with Northern and Western European ancestry 
*         TSI   Tuscan                  Toscani in Italia 
         GBR    British                 British in England and Scotland 
         FIN    Finnish                 Finnish in Finland 
         IBS    Spanish                 Iberian populations in Spain 

*         YRI   Yoruba                  Yoruba in Ibadan, Nigeria
*        LWK    Luhya                   Luhya in Webuye, Kenya
         GWD    Gambian                 Gambian in Western Division, The Gambia 
         MSL    Mende                   Mende in Sierra Leone
         ESN    Esan                    Esan in Nigeria

*         ASW   African-American SW     African Ancestry in Southwest US  
         ACB    African-Caribbean       African Caribbean in Barbados
         MXL    Mexican-American        Mexican Ancestry in Los Angeles, California
         PUR    Puerto Rican            Puerto Rican in Puerto Rico
         CLM    Colombian               Colombian in Medellin, Colombia
         PEL    Peruvian                Peruvian in Lima, Peru

*         GIH   Gujarati                Gujarati Indian in Houston, TX
         PJL    Punjabi                 Punjabi in Lahore, Pakistan
         BEB    Bengali                 Bengali in Bangladesh
         STU    Sri Lankan              Sri Lankan Tamil in the UK
         ITU    Indian                  Indian Telugu in the UK

I presume there is more public data nowadays that can be accessed to build a more complete vcf files of allele frequencies per population, with as many populations as possible to use as a reference dataset with new data.

Thanks in advance.

genome vcf allele frequencies population genomics • 1.6k views
ADD COMMENT
0
Entering edit mode

Hello 14134125465346445!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8838

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6