PCA using HapMap population data
1
1
Entering edit mode
8.1 years ago
eze.anokian ▴ 10

Hi there,

this is my first post at BioStars. I am a new starter bioinformatician. I have a problem that should be easy to solve but I cannot sort out, so I would be glad if you could help me with this issue.

The question is easy. I have a VCF file with genotype data of many samples. It contains SNPs in the rows and the columns are of a typical VCF (#CHROM,..., INFO) followed by the Ids of the samples. I would like to filter out the non-European samples according to these genotype data, using HapMap. I was told I had to do a PCA. I have tried several tools for this, Shellfish, Beagle, SNPRelate, but I could not solve the problem. With SNPRelate, I could do the PCA, but this just clusters samples that are unlabeled and I need to associate them to HapMap populations (CEU, YRI, JPT, CHB). On the other hand, Shellfish returns me a non-informative error when it is running:

Exception: command gtool -P --ped shellfish-temp-15479/146134504516.ped --map shellfish-temp-15479/146134504516.map --og shellfish-temp-15479/146134504516.gen --os shellfish-temp-15479/146134504516.sample --discrete_phenotype 0 >> shellfish.log exited with code 256 (1)

And in file shellfish.log:

...
Note: No phenotypes present.
--recode to plink.ped + plink.map ... done.
Unknown parameter: 0

What steps and tools would you recommend to follow? I can use any tool you think is suitable for this.

Sorry for this, it may be an easy problem, but I have spent 2 days trying several tools.

Many thanks.

PCA HapMap VCF genotype • 4.8k views
ADD COMMENT
9
Entering edit mode
8.1 years ago
Floris Brenk ★ 1.0k

Hi,

Dont have experience with those tools but I usually use plink.

First convert your vcf file to plink file using for example vcftools (http://vcftools.sourceforge.net/)

./vcftools --vcf input_data.vcf --plink --out output_in_plink

Then download the hapmap data from plink: http://pngu.mgh.harvard.edu/~purcell/plink/dist/hapmap_r23a.zip

Then extract snps lists from both datasets and filter based on the these snplist to get only overlapping snps.

plink --bfile fileA--write-snplist --out list1 --noweb
plink --bfile fileB --extract list1 --noweb --out fileB_filtered --make-bed

Then merge the files and make a mds plot

plink --bfile fileA_filtered --bmerge fileB_filtered.bed fileB_filtered.bim fileB_filtered.fam --noweb --out merged --make-bed

plink --bfile merged --mds-plot 2 --noweb --out mds

This can be easily plotted using R or even Excel... and then you can see which samples are derived from which ethic background.

ADD COMMENT

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6