CNT in Plink allele score fuction issue
1
0
Entering edit mode
5.7 years ago

The objective is to calculate the genetic risk score, given the genotyping data, effective allele and effective size. Totally 1.5 million SNPs are included in the bfile.bim, and they all have variant ID with form Chr:BP; e.g., 1:12345. However, after I submit the code

/projects/bsi/gentools/bin/plink2 --bfile GenotypingData --score ScoreFile header sum --threads 6

I have the results:

     FID        IID  PHENO    CNT      CNT2    SCORESUM
       01        01     -9    3067556 2438692    9.411
       02        02     -9    3067556 2440321  9.16466
       03        03     -9    3067556 2440784  9.50342
       04        04     -9    3067556 2443276   10.615

The question is why CNT is much smaller than 1.5 million? From Plink I know CNT is the #of nonmissing alleles used for scoring. The genotyping data has been QCed so no way so many alleles are missing.

PLINK SCORE • 2.9k views
ADD COMMENT
0
Entering edit mode

What exactly is the problem here? CNT is about 1.5 million * 2 (the doubling is expected for diploid genomes).

ADD REPLY
0
Entering edit mode

Thanks for your answer. If muliplying 2 is the case, then it makes sense. However, CNT2 is much less than CNT, so does that mean there are a portion of snps are not named?

ADD REPLY
0
Entering edit mode

Say, as explained, CNT is about 2#SNP. However, for CNT2, how to explain the discrepancy between CNT2 and either #SNP or 2#SNP? I tested several subjects, and the missing rate of SNPs (unobserved SNPs from the naming list) is very low. For example, for all 1.5million SNPs interested, only ~20 SNPs are not observed in the genotyping data of subject 1.

ADD REPLY
1
Entering edit mode
5.7 years ago

For people who may have the same silly problem as me, I straightforwardly give the answer here.

Suppose your ScoreFile has 1 line, rsID effect_allele beta 1:2245570 G -0.0276009

your GenotypingData.bim has a corresponding line ( ignore the ambiguous strands problem here)

1 1:2245570 2245570 G C

Therefore the ref allele (major allele) is the effect allele.

Suppose you have a patient with FID 16214852, in the result PLINK.PROFILE file, you find the result

16214852 16214852 -9 2 0 0

CNT2 = 0 AND the score = 0.

Now we try to figure out how Plink get the result. Let's recode the BFILES to .raw file, and we extract the dosage of patient 16214852 and the corresponding dosage of 1:2245570. I use awk to do that, and you can use R or cpp. The result is

1:2245570_C 2

Therefore, the dosage of the effect allele G = 0, that is why CNT2 = 0. So in human easy-understanding language, CNT2 = SUM OF DOSAGE(EFFECT ALLELE).

ADD COMMENT
1
Entering edit mode

"Allele named in the --score file" is perfectly clear, and remains accurate when --score is used for PCA projection instead. "Effect allele" sacrifices accuracy in the latter case for no actual benefit.

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6