Question: How are plink's hardcalls calculated?
1
Entering edit mode

According to plink documentation, a hardcall is saved when the distance from the nearest hardcall, defined as 0.5 * sum_i |x_i - round(x_i)| (where the x_i's are 0..2 allele dosages), is not greater than 0.1.

Truthfully, this doesn't make sense to me (especially why x_i is being summed) and was hoping if someone could help break down math.

Many thanks!

Entering edit mode
0

So lets say you had single dosages representing only 2 alleles.

Example 1:

A/T = 1.900

0.5* (|1.900-2| + |0.100 - 0|)

= 0.5 * (0.100 + 0.100)

= 0.5 * 0.200

= 0.100

As this is not greater than 0.1, we would code the dosage as 2 (for TT).

Example 2:

A/T = 0.980

0.5 * (|0.980 -1| + |1.02 - 1|)

=0.5 * (0.020 + 0.020)

=0.5 * 0.040

= 0.020

In this example, we would code it as A and T (heterogeneous).

My concern is how do you "decide" which alleles are present. Looking at the examples and my first example, it seems obvious. But I want to be able to program logic in a python script to check my understanding and compare my results against Plink.

JourneyToAbyss
• 70
Entering edit mode
1

Just look at when round(x_i) isn't zero.

chrchang523
5.2k
Entering edit mode
0

Clever. =)

Thank you

JourneyToAbyss
• 70
6
Entering edit mode

Suppose there are three alleles, with dosages 1.01, 0.92, and 0.07. Then the formula yields:

``````  0.5 * (|1.01 - 1| + |0.92 - 1| + |0.07 - 0|)
= 0.5 * (0.01 + 0.08 + 0.07)
= 0.5 * 0.16
= 0.08
``````

which is not greater than 0.1, so the nearest hardcall (one copy of the first allele and one copy of the second allele) is saved.

The formula is a little bit complicated because it has to generalize to multiallelic variants. It does exactly what youâ€™d expect when there are only two alleles, though.