Question: Post PCA analysis
0
Entering edit mode
6 months ago
charvinangia • 30

Hello,

I have data of two different populations - A and B. I merged both the data sets (unpruned A and LD pruned B) and did a PCA. I got two different clusters. Next, I'm supposed to identify the principle components that separate A and B, then take the A selective principle components and run a GWAS on it. How do I go about doing it?

Any suggestions please?

Thank you!

ADD COMMENTlink 6 months ago charvinangia • 30 • updated 6 months ago Kevin Blighe 43k
2
Entering edit mode
6 months ago
shawn.w.foley • 670
USA

How did you generate the PCA? What do you mean by "unpruned A and LD pruned B"? If you're using different filtering on your A and B populations that could be driving your association.

From experience I'd warn you to remember that these are associations/correlations and can mislead you. I had a beautiful PCA that separated populations 1 and 2, only to discover that it was driven by X and Y-linked genes because I had an overrepresentation of females in one population. So just be careful how much you infer from these data.

ADD COMMENTlink 6 months ago shawn.w.foley • 670
Entering edit mode
0

Data set B was pruned based on Linkage diseuqilibrium using the --indep-pairwise option in Plink. However, if I did pruning in A, it was removing most of the SNPs. Hence, I merged A as a whole (without pruning) to a pruned B and ran a PCA. My aim was to merge A and B (which are very different populations) and do a PCA on them. I did PCA using --pca option in Plink.

What I also want to know is, what does it mean by running a GWAS on the principle components? Does is mean that I use the PCAs as covariates in the GWAS?

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
2

You probably mean that you want to adjust your test statistics for population stratification via the inclusion of PCs (principal components) as covariates in your design formula. You should first check if the populations are segregated on PCA bi-plots and, if so, which PCs are segregating them.

I actually used PCA to predict ethnicity previously, with very high sensitivity/specificity on 1000 Genomes populations: A: How to predict individual ethnicity information by using hapmap data

ADD REPLYlink 6 months ago
Kevin Blighe
43k
Entering edit mode
0

Thanks Kevin! I have one more question ...may sound a bit silly. Does it make sense to use the principle component as a phenotype in GWAS?

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
1

People do use principal components as 'phenotypes', sometimes. So, you can use it if you wish. PCs are uncorrelated, which can help in the context of a regression model. Here is the proof of this: C: PCA in a RNA seq analysis

It may help you to first investigate which PCs are of interest by investigating bi-plots.

I had a package recently accepted to Bioconductor, too, but it is not yet officially released: https://github.com/kevinblighe/PCAtools

ADD REPLYlink 6 months ago
Kevin Blighe
43k
Entering edit mode
1

Thank you once again Kevin!

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
0

By running a GWAS on the 2 merged populations A and B, I will be using the first principal component as a phenotype. Will that give me information about which principle component belongs to which population?

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
1

No, an inspection of the bi-plot will tell you that. Look how I do it here at the end of the tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Looking at the bi-plots for PC1 vs PC2 (left) and PC1 vs PC3 (right), I can say, for example, that PC1 segregates the African population from the other populations. PC2 segregates the East-Asians from all other populations. PC3 segregates South Asians from all other populations.

biplot

ADD REPLYlink 6 months ago
Kevin Blighe
43k
Entering edit mode
1

Beautiful! Thank you so much Kevin! It has been of immense help.

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
0

So I did a GWAS using the first principal component as a phenotype and plotted a manhattan plot. I got a few hundred hits (above the 5* e -8) Is that normal? My next step would be to use the top hits as covariates for the GWAS on my main phenotype.

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
1

One usually includes the PCs as covariates for the purpose of adjusting for population stratification in your sample cohort. I was not aware that you are simply using the PCs as the main phenotype of interest.

For example, If I had samples from Ireland, France, Italy, and UK and I am studying haemochromatosis, my model may be:

HaemochromatosisStatus ~ SNP + PC1 + PC2

I include the PCs for the purpose of adjusting for likely natural differences between my populations, but my main interest is haemochromatosis status.

What is the overall aim of your study?

ADD REPLYlink 6 months ago
Kevin Blighe
43k
Entering edit mode
0

The overall aim is to identify age at diagnosis of diabetes in 2 populations.

1.I did a GWAS for age at diagnosis for the first population. 2.Next, I merged the data of the 2 populations and did a PCA on them. I got 2 different clusters. 3.Next, I was asked to run a GWAS using the first PC as a phenotype. 4. The tophits of the above GWAS should be run for age at diagnosis.

I do not quite understand the context from step 3 onwards.

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
1

If you do Step1 on both populations, do the results differ?

Step3 is likely related to the point that I was making, i.e., after you have merged the populations together, include PC1 (and/or any other PCs along which your populations are segregated on a bi-plot) as a covariate in order to adjust for the effects of population stratification.

So, you would have 3 sets of results:

  1. GWAS hits for first population (Diabetes ~ SNP + age)
  2. GWAS hits for second population (Diabetes ~ SNP + age)
  3. GWAS hits for populations combined (population effect adjusted by including PCs as covariates) (Diabetes ~ SNP + age + PC1)

I would compare these sets back to your supervisor / collaborator.

Note that the formulae that I list above are testing for Diabetes status while adjusting for age. You may have your own formulae different.

ADD REPLYlink 6 months ago
Kevin Blighe
43k
Entering edit mode
0

Yes, the results are different.

Okay, will try out your step 3.

Thanks!

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
0

Hi Kevin, I'm still a little stuck with this work.

Can I please know what you meant when you wrote 'People do use principal components as 'phenotypes', sometimes. So, you can use it if you wish. PCs are uncorrelated, which can help in the context of a regression model' in the comments above?

Just to rewind a bit. I have 2 populations -A and B. I did a gwas on age of diagnosis for diabetes in population A. Then I merged A and B, did a PCA on them, used the first principal component as a phenotype (without any covariates) and ran a gwas on the merged populations. This gave me a list of SNPs which show the difference between the 2 populations (I got about 0.1 million hits here. Is that normal?) I then used the SNPs from the tophits (the 0.1 million) and ran a gwas for age at diagnosis for the Population A. I did not get any hits this time.

A few questions,

Does this experiment validate the results of my original gwas on population A for age at diagnosis before I did the PCA?

Does this show the genetic reasons for different age at diagnosis for diabetes in the 2 populations? (Literature shows that population A gets diabetes at a lower age than B).

If not the above, what does it all mean?

I'm quite confused now.

ADD REPLYlink 6 months ago
charvinangia
• 30
Entering edit mode
0

Hey, what is your actual model / design formula? I do not believe you ever stated clearly in this thread. It does not make much sense to be testing your SNPs against just the PC, which I believe you may have done.

We include PCs as 'phenotypes' (more correct to use the term 'covariates') in GWAS studies in order to control for population stratification.

It would help to show exact code that you are running. Sometimes, between code and written text, much information can become lost or confusing.

By the way, if the literature shows that population A gets diabetes at a lower age than B, then do you even need to combine these together? It may be more intuitive to process them separately and just compare results, e.g., perform a meta-analysis.

ADD REPLYlink 6 months ago
Kevin Blighe
43k

Login before adding your answer.

Powered by the version 1.5