More efficient way than zipping arrays for transposing a FinalReport table in Python?

0

Entering edit mode

6.0 years ago

Volka ▴ 180

I have been trying to transpose my FinalReport table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?

import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
    m.append(line.strip().split("\t"))
for i in zip(*m):
    for j in range(len(i)):
        if j != len(i):
            print(i[j] +Seperator)
        else:
            print(i[j])
    print ("\n")

I'm not able to finishing storing the entire array, it always gets killed towards the end of the 'm.append' step. Out of the 2379856 lines, the furthest I've gotten is 2321894 lines. I got the numbers by printing a number count after the append line.

Thanks very much in advance!

python snp genome • 1.1k views

ADD COMMENT • link 6.0 years ago by Volka ▴ 180

0

Entering edit mode

Do you need to store it in memory? Can you not use the transpose function in numpy (assuming you're using numpy)?

ADD REPLY • link 6.0 years ago by Devon Ryan 104k

0

Entering edit mode

Oh, I'm not using numpy because my data is alphanumerical.

ADD REPLY • link 6.0 years ago by Volka ▴ 180

0

Entering edit mode

It's still faster and more memory efficient in numpy. Alternatively, if the file is much larger than the RAM you have available, do multiple passes over the file so you just process a column (or a few) at a time.

ADD REPLY • link 6.0 years ago by Devon Ryan 104k

0

Entering edit mode

Have you tried CSVTK's transpose (here)? The tools is working well on my tables, but these aren't that big.

ADD REPLY • link 6.0 years ago by michael.ante ★ 3.8k

Login before adding your answer.