Question

Whole exome sequencing data, rare variants and QQ-plots

7

Entering edit mode

7.9 years ago

felejohs ▴ 70

Hi

I'm having a problem with a whole-exome-sequenced dataset consisting of about 400 human subjects, 200 cases with a certain disease, and 200 controls without. The dataset has been through a rigorous quality control (standardised QC in plink with HWE, IBD, missingness, sex-check ++, along with HapMap population stratification and Eigenstrat/PCA-analysis). I´m using plink to do a basic association-analysis for all variants between cases and controls, and while the resulting QQ-plot for the common (MAF > 0.01) variants is OK, the plot for the rare (MAF < 0.01) variants is less so. Below are the three QQ-plots for all, common and rare variants along with lambda-values:

QQ-plot all variants, lambda 1.83 http://postimg.org/image/oyr53wchr/
QQ-plot common variants, lambda 1.04 http://postimg.org/image/6lqjtc20v/
QQ-plot rare variants, lambda 2.43 http://postimg.org/image/ic4haputb/

The main problem seems to be the positive deviation (observed > expected) of the rare variants in the first part of the plot, causing the lambda to be very big, both for the QQ-plot for rare and all variants. I am wondering what could be the cause of this behaviour for the rare variants, and also what the implications this has for the prospects of doing analysis on rare and common variants together.

I would be grateful if anybody has any experience in these matters and could provide some input.

Many thanks.

QQ-plot whole exome sequencing WES • 4.1k views

ADD COMMENT • link updated 5.6 years ago by zx8754 11k • written 7.9 years ago by felejohs ▴ 70

3

Entering edit mode

7.9 years ago

LauferVA 4.2k

Principal components are good for catching large scale differences in population structure, but much less good at catching fine-scale differences between populations. Due to certain principles of population genetics and natural selection, these fine-scale differences generally tend to be captured in rare variation more so than in common variation. As a result, your pipeline might have done a lot to control for common variation, but much less to control for confounds introduced in rare variants.

Please see these papers:

These two papers will provide a good starting point for understanding what controlling for differences in population structure with PCA might miss.

If you have further questions, please let me know.

ADD COMMENT • link updated 5.6 years ago by zx8754 11k • written 7.9 years ago by LauferVA 4.2k

0

Entering edit mode

Hi

Thank you for your answer. That is very interesting. My understanding is that most QC methods and protocols have been developed for common variants (GWAS), so I´ve always been curious how they cope with rare variants in exome-data. If I´m reading these papers correctly they used a varying number of principal components to control for these subpopulations in the data.

As I understand it Eigenstrat was originally developed for GWAS. Would using PCs generated by eigenstrat as covariates work for my data to help control for these spurious associations?

ADD REPLY • link 7.9 years ago by felejohs ▴ 70

1

Entering edit mode

Hello! If there are in fact fine scale population structure differences, PC will not catch them. Thus, potentially confound could slip in. I'd start with the suggestion below.

ADD REPLY • link 7.9 years ago by LauferVA 4.2k

zx8754 · Accepted Answer · 2016-06-01

3

Entering edit mode

7.9 years ago

Lemire ▴ 940

QC has likely nothing to do with your problem. Here are two things you need to think about:

you're using a 2df test (genotypic?) for rare variants even though some of your cell counts are likely to be 0.
You're using a qq-plot designed to assess the distribution of a continuous variable when yours only has a finite number of possible values due to the small cell counts, making the qq-plot uninterpretable.

Don't over think the qq-plot in that case.

ADD COMMENT • link updated 5.6 years ago by zx8754 11k • written 7.9 years ago by Lemire ▴ 940

0

Entering edit mode

Thank you for your input. Is there another way to verify the quality of the rare variants? Or in other words, if the QC produces a good QQ-plots for the common variants, would you be satisfied and move forwards even though your planned analysis relies heavily on rare variants (collapsing rare variants on genes and pathways)?

ADD REPLY • link 7.9 years ago by felejohs ▴ 70

zx8754 · Accepted Answer · 2016-06-02

2

Entering edit mode

7.9 years ago

LauferVA 4.2k

I would generate p-values using an exact test, and see if there is still inflation, personally.

ADD COMMENT • link updated 5.6 years ago by zx8754 11k • written 7.9 years ago by LauferVA 4.2k

0

Entering edit mode

Using fisher´s exact test, the inflation disappeared. Thank you everybody, you´ve been very helpful.

ADD REPLY • link 7.9 years ago by felejohs ▴ 70

0

Entering edit mode

thats great news! Thanks for letting us know. Glad you did not have to battle any of the fine-ancestry problems I had to deal with.

ADD REPLY • link 7.9 years ago by LauferVA 4.2k