More efficient way than zipping arrays for transposing a FinalReport table in Python?
0
0
Entering edit mode
6.0 years ago
Volka ▴ 180

I have been trying to transpose my FinalReport table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?

import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
    m.append(line.strip().split("\t"))
for i in zip(*m):
    for j in range(len(i)):
        if j != len(i):
            print(i[j] +Seperator)
        else:
            print(i[j])
    print ("\n")

I'm not able to finishing storing the entire array, it always gets killed towards the end of the 'm.append' step. Out of the 2379856 lines, the furthest I've gotten is 2321894 lines. I got the numbers by printing a number count after the append line.

Thanks very much in advance!

python snp genome • 1.1k views
ADD COMMENT
0
Entering edit mode

Do you need to store it in memory? Can you not use the transpose function in numpy (assuming you're using numpy)?

ADD REPLY
0
Entering edit mode

Oh, I'm not using numpy because my data is alphanumerical.

ADD REPLY
0
Entering edit mode

It's still faster and more memory efficient in numpy. Alternatively, if the file is much larger than the RAM you have available, do multiple passes over the file so you just process a column (or a few) at a time.

ADD REPLY
0
Entering edit mode

Have you tried CSVTK's transpose (here)? The tools is working well on my tables, but these aren't that big.

ADD REPLY

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6