Question

Problems with PCA for genotyped data

0

Entering edit mode

5.1 years ago

doodle ▴ 30

Hello,

I have two different genotyped data sets - say A and B( Two very different populations). I have done PCA on A and it shows population clusters within the data without any pruning. Pruning removes most of the SNPs.

For the second part, I have to merge A and B and do a PCA on the merged data- this does not show any clusters without pruning. There was not much difference with pruning either.

Thirdly, I tried doing a PCA only on data set B and this also doesn't show population clusters with or without pruning. But from my phenotype data, I know that there is variation.

I did PCA using bfiles in Plink using the --pca flag.

Any suggestions please?

Thank you!

genotyped data pca • 2.8k views

ADD COMMENT • link 5.1 years ago by doodle ▴ 30

score 0 · Answer 1 · 2019-03-07

0

Entering edit mode

5.1 years ago

Kevin Blighe 87k

What are your cut-offs that you are using while pruning? - that is key. Why are you so convinced that there should be population structure / clusters in B?

ADD COMMENT • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

I did --indep-pairwise using a cutoff of 50 5 0.2. What can be changed here?

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

There should be clusters in B because the phenotype data shows they are coming from different geographical regions.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

Has it been shown already that these geographical regions have distinct genetic profiles?

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Do you understand to what each of these numbers relates? You may need to adjust them based on your SNP density.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

What I understand from this is that 50 is the window size within which variants which are highly correlated are removed, 5 is the step size and 0.2 is the r2 threshold. I don't understand how I can adjust them based on SNP density. Can you please help with that?

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

Which array data is it?

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Both the data sets are imputed. A has 38 million markers and B has 28 million. Together they have about 64 million markers. Both were genotyped on illumina- A on illumina infinium GSA and B on a slightly older version- i'm not sure which one.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

Illumina has many arrays of differing genotype densities.

Look at it this way: if your SNPs are spaced 100 kilobase apart across the genome, then there is not much utility in using --indep-pairwise because the SNPs are already sparsely distributed. The idea of --indep-pairwise is to prune SNPs based on linkage equilibrium.

Another thing that you can look at is the MAF of your variants. You may want to remove rare variants, as these, by definition, will not be present in many samples and thus add minimal information to the type of analysis that you want to do.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much Kevin! An --indep-pairwise cutoff of 1000 5 0.2 worked!

Sorry, I couldn't reply yesterday due to the messaging limit for new users.

ADD REPLY • link 5.1 years ago by doodle ▴ 30