I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.
A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. The input file is something like below:
(SNP1 SNP2 SNP3) # example header not in file
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
I then have the following code, which can take in the input file above and a table containing weights and frequencies, such as below:
(SNP RiskAll RefAll OR log(OR) RiskAllFreq) # example header not in file
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
It then computes a score, based on the sum of log odds ratios for each risk allele for each SNP in a genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.
snp=[]
riskall={}
weights={}
popfreqs={} # frequency of effect allele
# read in table containing SNP IDs, respective risk alleles, log(OR) and allele frequencies
with open(table, 'r') as f:
for line in f:
snp.append(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
popfreqs[line.split()[0]]=line.split()[5]
# compute scores and frequencies for each genetic profile
freqs={}
scores={}
with open(geneticprofiles, 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskall[snp[j]]: # homozygous for reference allele
score+=0
freq*=(float(1-float(popfreqs[snp[j]]))*float(1-float(popfreqs[snp[j]])))
elif allele1 != riskall[snp[j]] and allele2 == riskall[snp[j]]: # heterozygous
score+=(float(weights[snp[j]]))
freq*=(2*(float(1-float(popfreqs[snp[j]]))*float(popfreqs[snp[j]])))
elif allele1 == riskall[snp[j]]: # homozygous for risk allele
score+=2*(float(weights[snp[j]]))
freq*=(float(popfreqs[snp[j]])*float(popfreqs[snp[j]]))
if freq < float(sys.argv[3]): # frequency threshold to break loop
break
print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))
So far, I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low, for instance, at approximately 1e-10. I am looking to ramp this up to include at least 20 SNPs. What improvements can be done to speed up the script? Right now, I am using a threshold of 1e-4, but six days in the script is still running.