Entering edit mode

I'm looking to understand how best to calculate estimated ethnicity from a sample VCF. That is to take a VCF file and estimate that the person that the file came from is 80% caucasian and 20% asian. At least to the level of the 5 super-populations of the 1000 genomes project, and even better if to the 26 sub-populations of the 1000 genomes project.

I've read about approaches using BEAGLE and other tools that do this well for analyzing a set of VCF - I'm not sure if that is helpful here as I am interested in something that could perform this analysis on a new sample without rerunning it on the entire set.

Does anyone have any pointers?

Entering edit mode

If you want to get ancestry estimates for your sample, probably the easiest way would be to do that using ADMIXTURE, a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm (as described on the website: https://www.genetics.ucla.edu/software/admixture/)

I recommend you to follow the manual and run the program. Don't forget to remove related individuals from your dataset (you can use PLINK's PI_HAT value for this). Do QC of your individuals and SNPs (you can do this using PLINK as well).
Before running ADMIXTURE you will have to prune your SNPs, i.e. removing LD between them. Check the `--indep`

or `--indep-pairwise`

function in PLINK. PLINK can also read VCF files as long as they are bi-allelic sites.

Depending on what type of analysis you want to conduct, maybe you would want to explore more sophisticated methods. For instance, check Chromopainter, which uses haplotype information and not just allele frequency in order to estimate "more accurate" ancestry proportions: http://www.paintmychromosomes.com/ However, you will need to phase your data first, and can be a little bit more complicated to run.

HTH

Entering edit mode

To find the ethnic sub group your sample falls in, you could pick a set of common SNPs (MAF > 5% within each sub-population group) common to your sample and the 1000 genomes data. Do a PCA of the 1000 genomes samples using eigenstrat's smartPCA and project your sample into that pre-computed space to see which sub-population group it clusters with.

I haven't calculated the ethnic %s before but I think there might be a way to do it by measure of variance within clusters.

Loading Similar Posts