Question

PCA using HapMap population data

1

Entering edit mode

8.1 years ago

eze.anokian ▴ 10

Hi there,

this is my first post at BioStars. I am a new starter bioinformatician. I have a problem that should be easy to solve but I cannot sort out, so I would be glad if you could help me with this issue.

The question is easy. I have a VCF file with genotype data of many samples. It contains SNPs in the rows and the columns are of a typical VCF (#CHROM,..., INFO) followed by the Ids of the samples. I would like to filter out the non-European samples according to these genotype data, using HapMap. I was told I had to do a PCA. I have tried several tools for this, Shellfish, Beagle, SNPRelate, but I could not solve the problem. With SNPRelate, I could do the PCA, but this just clusters samples that are unlabeled and I need to associate them to HapMap populations (CEU, YRI, JPT, CHB). On the other hand, Shellfish returns me a non-informative error when it is running:

Exception: command gtool -P --ped shellfish-temp-15479/146134504516.ped --map shellfish-temp-15479/146134504516.map --og shellfish-temp-15479/146134504516.gen --os shellfish-temp-15479/146134504516.sample --discrete_phenotype 0 >> shellfish.log exited with code 256 (1)

And in file shellfish.log:

...
Note: No phenotypes present.
--recode to plink.ped + plink.map ... done.
Unknown parameter: 0

What steps and tools would you recommend to follow? I can use any tool you think is suitable for this.

Sorry for this, it may be an easy problem, but I have spent 2 days trying several tools.

Many thanks.

PCA HapMap VCF genotype • 4.8k views

ADD COMMENT • link updated 4.9 years ago by zx8754 11k • written 8.1 years ago by eze.anokian ▴ 10

score 9 · Accepted Answer · 2016-03-20

Hi,

Dont have experience with those tools but I usually use plink.

First convert your vcf file to plink file using for example vcftools (http://vcftools.sourceforge.net/)

./vcftools --vcf input_data.vcf --plink --out output_in_plink

Then download the hapmap data from plink: http://pngu.mgh.harvard.edu/~purcell/plink/dist/hapmap_r23a.zip

Then extract snps lists from both datasets and filter based on the these snplist to get only overlapping snps.

plink --bfile fileA--write-snplist --out list1 --noweb
plink --bfile fileB --extract list1 --noweb --out fileB_filtered --make-bed

Then merge the files and make a mds plot

plink --bfile fileA_filtered --bmerge fileB_filtered.bed fileB_filtered.bim fileB_filtered.fam --noweb --out merged --make-bed

plink --bfile merged --mds-plot 2 --noweb --out mds

This can be easily plotted using R or even Excel... and then you can see which samples are derived from which ethic background.