Question

Closed:Improving efficiency of a script for genotype scoring that enumerates all allele combinations

0

Entering edit mode

5.5 years ago

Volka ▴ 180

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. The input file is something like below:

  (SNP1 SNP2 SNP3) # example header not in file
    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

I then have the following code, which can take in the input file above and a table containing weights and frequencies, such as below:

(SNP        RiskAll  RefAll         OR            log(OR)    RiskAllFreq) # example header not in file
SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136

It then computes a score, based on the sum of log odds ratios for each risk allele for each SNP in a genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.

snp=[]
riskall={}
weights={}
popfreqs={}    # frequency of effect allele

# read in table containing SNP IDs, respective risk alleles, log(OR) and allele frequencies
with open(table, 'r') as f:
    for line in f:
        snp.append(line.split()[0])
        riskall[line.split()[0]]=line.split()[1]
        weights[line.split()[0]]=line.split()[4]
        popfreqs[line.split()[0]]=line.split()[5]


# compute scores and frequencies for each genetic profile
freqs={}
scores={}
with open(geneticprofiles, 'r') as f:
    for line in f:
        score=0
        freq=1
        for j in range(len(line.split())):
            allele1=line.split()[j][0]
            allele2=line.split()[j][1]
            if allele2 != riskall[snp[j]]:      # homozygous for reference allele
                score+=0
                freq*=(float(1-float(popfreqs[snp[j]]))*float(1-float(popfreqs[snp[j]])))
            elif allele1 != riskall[snp[j]] and allele2 == riskall[snp[j]]:  # heterozygous
                score+=(float(weights[snp[j]]))
                freq*=(2*(float(1-float(popfreqs[snp[j]]))*float(popfreqs[snp[j]])))
            elif allele1 == riskall[snp[j]]:      # homozygous for risk allele
                score+=2*(float(weights[snp[j]]))
                freq*=(float(popfreqs[snp[j]])*float(popfreqs[snp[j]]))


            if freq < float(sys.argv[3]):   # frequency threshold to break loop 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

So far, I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low, for instance, at approximately 1e-10. I am looking to ramp this up to include at least 20 SNPs. What improvements can be done to speed up the script? Right now, I am using a threshold of 1e-4, but six days in the script is still running.

efficiency loop frequency calculation • 793 views

ADD COMMENT • link updated 5.4 years ago by Michael 54k • written 5.5 years ago by Volka ▴ 180