Off topic:How can I utilize vectorization on my Pandas script for efficiency?
0
0
Entering edit mode
5.5 years ago
Volka ▴ 180

this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.

Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:

import pandas as pd

numbers = pd.read_csv(table2, sep="\t", header=None)

combinations = pd.read_csv(table1, sep=" ", header=None)

def score_freq(line):
    score=0
    freq=1
    for j in range(len(line)):
        if line[j][1] != numbers.values[j][1]:   # homozygous for ref
            score+=0
            freq*=(float(1-float(numbers.values[j][6]))*float(1-float(numbers.values[j][6])))
        elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
            score+=(float(numbers.values[j][5]))
            freq*=(2*(float(1-float(numbers.values[j][6]))*float(numbers.values[j][6])))
        elif line[j][0] == numbers.values[j][1]:   # homozygous for risk
            score+=2*(float(numbers.values[j][5]))
            freq*=(float(numbers.values[j][6])*float(numbers.values[j][6]))

        if freq < 1e-05:   # threshold to stop loop in interest of efficiency 
            break

    return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?

print(combinations)

I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!

pandas python vectorization • 1.2k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6