Question

Off topic:How can I utilize vectorization on my Pandas script for efficiency?

0

Entering edit mode

5.5 years ago

Volka ▴ 180

this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.

Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:

import pandas as pd

numbers = pd.read_csv(table2, sep="\t", header=None)

combinations = pd.read_csv(table1, sep=" ", header=None)

def score_freq(line):
    score=0
    freq=1
    for j in range(len(line)):
        if line[j][1] != numbers.values[j][1]:   # homozygous for ref
            score+=0
            freq*=(float(1-float(numbers.values[j][6]))*float(1-float(numbers.values[j][6])))
        elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
            score+=(float(numbers.values[j][5]))
            freq*=(2*(float(1-float(numbers.values[j][6]))*float(numbers.values[j][6])))
        elif line[j][0] == numbers.values[j][1]:   # homozygous for risk
            score+=2*(float(numbers.values[j][5]))
            freq*=(float(numbers.values[j][6])*float(numbers.values[j][6]))

        if freq < 1e-05:   # threshold to stop loop in interest of efficiency 
            break

    return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?

print(combinations)

I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!

pandas python vectorization • 1.2k views

ADD COMMENT • link 5.5 years ago by Volka ▴ 180