Biostar Beta. Not for public use.
PCA using HapMap population data
1
Entering edit mode
2.7 years ago
eze.anokian • 10

Hi there,

this is my first post at BioStars. I am a new starter bioinformatician. I have a problem that should be easy to solve but I cannot sort out, so I would be glad if you could help me with this issue.

The question is easy. I have a VCF file with genotype data of many samples. It contains SNPs in the rows and the columns are of a typical VCF (#CHROM,..., INFO) followed by the Ids of the samples. I would like to filter out the non-European samples according to these genotype data, using HapMap. I was told I had to do a PCA. I have tried several tools for this, Shellfish, Beagle, SNPRelate, but I could not solve the problem. With SNPRelate, I could do the PCA, but this just clusters samples that are unlabeled and I need to associate them to HapMap populations (CEU, YRI, JPT, CHB). On the other hand, Shellfish returns me a non-informative error when it is running:

Exception: command gtool -P --ped shellfish-temp-15479/146134504516.ped --map shellfish-temp-15479/146134504516.map --og shellfish-temp-15479/146134504516.gen --os shellfish-temp-15479/146134504516.sample --discrete_phenotype 0 >> shellfish.log exited with code 256 (1)

And in file shellfish.log:

...
Note: No phenotypes present.
--recode to plink.ped + plink.map ... done.
Unknown parameter: 0

What steps and tools would you recommend to follow? I can use any tool you think is suitable for this.

Sorry for this, it may be an easy problem, but I have spent 2 days trying several tools.

Many thanks.

ADD COMMENTlink
9
Entering edit mode
15 months ago
Floris Brenk • 890
USA

Hi,

Dont have experience with those tools but I usually use plink.

First convert your vcf file to plink file using for example vcftools (http://vcftools.sourceforge.net/)

./vcftools --vcf input_data.vcf --plink --out output_in_plink

Then download the hapmap data from plink: http://pngu.mgh.harvard.edu/~purcell/plink/dist/hapmap_r23a.zip

Then extract snps lists from both datasets and filter based on the these snplist to get only overlapping snps.

plink --bfile fileA--write-snplist --out list1 --noweb
plink --bfile fileB --extract list1 --noweb --out fileB_filtered --make-bed

Then merge the files and make a mds plot

plink --bfile fileA_filtered --bmerge fileB_filtered.bed fileB_filtered.bim fileB_filtered.fam --noweb --out merged --make-bed

plink --bfile merged --mds-plot 2 --noweb --out mds

This can be easily plotted using R or even Excel... and then you can see which samples are derived from which ethic background.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3