Question

PLINK heterozygosity - Negative F statistic?

0

Entering edit mode

6.0 years ago

Caragh ▴ 40

Hi there,

I am running QC on some GWAS data on 48 samples that we are planning to do linkage analysis with. I have run the --het command in PLINK to check for excess heterozygostity and/or consanguinity.

For every sample I am getting a negative F statistic, indicating there is more heterozygosity that expected. But I'm wondering - how much heterozygosity is too much heterozygosity? Is there a threshold that would indicate sample contamination or is it solely based on the distribution amongst the 48 samples?

Here is an example of the -- het results file for a subset of the samples:

IID O(HOM)  E(HOM)  N(NM)   F
1   292543  295900  388776  -0.03604
2   299349  302200  396946  -0.03016
3   298893  302000  396663  -0.03272
4   299188  302600  397491  -0.03591
5   298827  302200  396937  -0.0354
6   274894  283200  372565  -0.09318
7   298750  302500  397353  -0.03951
8   298737  302300  397082  -0.03761
9   299640  302600  397511  -0.03138

Any help would be greatly appreciated!

Thanks,

Caragh

plink heterozygosity gwas • 5.6k views

ADD COMMENT • link updated 5.7 years ago by moldach ▴ 130 • written 6.0 years ago by Caragh ▴ 40

score 3 · Accepted Answer · 2018-04-07

3

Entering edit mode

6.0 years ago

Ram 43k

AFAIK these are negligible negatives. Unless I'm mistaken, negative values suggest contamination and not heterozygosity.

ADD COMMENT • link 6.0 years ago by Ram 43k

0

Entering edit mode

Thanks for your help Ram!

ADD REPLY • link 5.9 years ago by Caragh ▴ 40

0

Entering edit mode

According to this link:

"The estimate of F can sometimes be negative. Often this will just reflect random sampling error, but a result that is strongly negative (i.e. an individual has fewer homozygotes than one would expect by chance at the genome-wide level) can reflect other factors, e.g. sample contamination events perhaps."

Ram when you said these negative values are "negligible" I assume you meant that those individuals would not need to be removed.

However, I feel it would be helpful if you could provide a more objective definition of negligible (i.e. provide numbers)? At what values do these negative scores no longer become "negligible" (e.g. -0.5, -1.8 something else) and should be removed?

ADD REPLY • link 5.7 years ago by moldach ▴ 130

0

Entering edit mode

That's a wonderful question, and unfortunately I do not have a strong reason for picking a definite threshold. Part of it is owing to what we saw in the majority of the samples that were sequenced, and part of it was just practice that was carried over. We had a threshold of around -0.2. Samples with a lot of contamination would drop out in other steps of the pipeline as well, and by the time we got to measuring an F statistic, one would rarely see values cross -0.25 - the pipeline discarded samples at stages including sample prep, sequencing, and other metrics were also used pre-F stat QC stages. For example, we would use a filter to obtain a high quality set of variants that contributed to the F statistic (and other such statistics), so any underlying condition that affected the sample would strongly affect what we saw.

All said and done, it was still subjective to an extent in that it worked for our consortium.

ADD REPLY • link 5.7 years ago by Ram 43k

0

Entering edit mode

Thank you for that detailed response.

It makes sense that those samples suffering from contamination are dropping out in other (upstream) steps of the QC pipeline.

Furthermore, certain things we do in bioinformatics analysis can often be subjective in the sense that they are practices carried over (within a lab/consortium/sub-field) and not necessarily objective standards (i.e. benchmarking) but they just work

ADD REPLY • link 5.7 years ago by moldach ▴ 130