A Shift In Pca Plot For Population Stratification
1
9
Entering edit mode
10.1 years ago
User 1933 ▴ 340

To do the population comparison between cohort of patient and 1KG. So, I have converted their VCF file to PED format;

vcftools --gzvcf 1000g/1000g_myvariants.vcf.gz --plink --out 1000g
vcftools --vcf myvariants.vcf --plink --out myvariants

and then, I took variant with snp ID

grep -o 'rs[0-9]*' 1000g_myvariants.map > rs.snplist.raw

and sorted and removed those were duplicated

sort rs.snplist.raw | uniq > rs.snplist.dedup

then I removed those were not matched allele codes

plink --file myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out myvariants.subset
plink --file 1000g_myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out 1000g_myvariants.subset

and finally I merged them

plink --file 1000g_myvariants.subset --merge myvariants.subset.ped myvariants.subset.map --recode --out all

and I created MDS plot

plink --file all --read-genome all.genome --cluster --mds-plot 2 --out all_mds_2

and plotted component 2 versus component 1

tab = read.table("plink.mds", h = T)
tab$pop = factor(c(rep("1KG", 1212), rep("mycohort", 285)))
plot(tab$C1, tab$C2, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")

and here is how the result look like,

PCA

basically, there is a shift which I am curious what could be the reason ? do I have to filter more SNP to get the right match ? is there any other tools to run PCA rather than PLINK?

Is the 1000 genome variants some how normalized while the other cohort is not ?

pca exome-sequencing • 6.6k views
ADD COMMENT
0
Entering edit mode

Are the black dots individuals from 1000genomes? Which dataset are you using, exactly? Check if the separate groups are due to different sequencing technology.

ADD REPLY
0
Entering edit mode

yes, there are 1212 individuals in 1KG which are represented by black in the plot. mm, the 1KG sequencing has been done both with Illumina and ABI sequencing; I feel I should have normalize these two cohort separately somehow before hand.

ADD REPLY
0
Entering edit mode

Have you tried doing factor analysis to see which SNPs are underlying this? You could also just look at the rotated data (if you were to do the PCA with the prcomp() function in R, this would be output$x). That's the next thing I would try.

ADD REPLY
2
Entering edit mode
8.9 years ago
Zhenyu Zhang ★ 1.2k

Hi. I am wondering if you have figured out the reason. I recently did a similar smartpca analysis, and also see such phenomena of population stratification shift.

ADD COMMENT
2
Entering edit mode

OK, let me answer myself. I filter the data with HWE, and then it looks great now.

ADD REPLY
1
Entering edit mode

also you have to append 1KG data and yours and doing one pca. not two separate.

ADD REPLY

Login before adding your answer.

Traffic: 1674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6