Question

Off topic:Dictionary to Matrix

0

Entering edit mode

4.9 years ago

flogin ▴ 280

I have received an archive with several results from DNAsp, the content of the file is like that (only the firsts lines, the original archive have zillions of lines.:

     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 0   (Total number of mutations: 0)
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,0000000000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 0
     Number of significant pairwise comparisons by Fisher's exact test: 0
     Number of significant pairwise comparisons by chi-square test: 0
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 11   (Total number of mutations: 11)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 55
     Number of significant pairwise comparisons by Fisher's exact test: 51
     Number of significant pairwise comparisons by chi-square test: 51
     Value of ZnS (Kelly 1997): 0,4058
     Value of Za (Rozas et al. 2001): 0,5058
     Value of ZZ (Rozas et al. 2001): 0,1001
      r^2 values:  Y = 0,4200 - 0,0668X   (55 points)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
        Nucleotide diversity, Pi: 0,00662
     Tajima's D: 2,27081     Statistical significance: *, P < 0.05
     Coding region: Tajima's D: 2,27081     *, P < 0.05
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
   Variable (polymorphic) sites: 20   (Total number of mutations: 20)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Number of pairwise comparisons: 190
 Number of significant pairwise comparisons by Fisher's exact test: 1
 Number of significant pairwise comparisons by chi-square test: 145
 Value of ZnS (Kelly 1997): 0,6608
 Value of Za (Rozas et al. 2001): 0,7791
 Value of ZZ (Rozas et al. 2001): 0,1183
  r^2 values:  Y = 0,6736 - 0,0530X   (190 points)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
    Nucleotide diversity, Pi: 0,00642
 Tajima's D: -1,83426     Statistical significance: *, P < 0.05
 Coding region: Tajima's D: -1,83426     *, P < 0.05
 NonSynonymous sites: Tajima's D(NonSyn): -1,74110     *, P < 0.05

As we can see, had several redundant lines (Input data file, for example), and the specific information is always after ":". To analyze those results, I want to make a table with several pieces of information (name of the file, number of sites, and the results of each evolutive test (Tajima, Fu and Li's and Linkage Disequilibrium).

I have a little experience with Python, and I think that a method with python dictionary and conversion to data frame can be a good choice to resolve my problem. So I wrote that script:

# -*- coding: utf-8 -*-
import pandas as pd
# opening txt file
file = open("GST71_OK.txt","r")
#creating output file
output = open('output.csv','w+')
# creating dictionary keys order
keys_order = ["Input Data File","Number of sites","Variable (polymorphic) sites", "Nucleotide diversity, Pi",
              "Value of ZnS", "Value of Za","Value of ZZ","r^2 values","Number of pairwise comparisons",
             "Fisher's exact test","chi-square test","Tajima's D","Tajima's D(Syn)","Tajima's D(NonSyn)",
              "Tajima's D(Sil)"]

# creating dictionary
dictio = dict()
# searching patterns by line
for line in file:
    for key in keys_order: # patterns are present in keys_order
        key, values = line.strip().split(":") # values are present after ':'
        dictio.setdefault(key, set()).update(values)
# converting dictionary to dataframe
dictio_df = pd.DataFrame.from_dict(dictio, orient='index', 
                           columns=['Input','Total Sites', 'Polymorphic Sites', 'Pi', 'ZnS', 'Za','ZZ', 'r^2','Pairwise Comparisons',
                                    'Fischer','Chi^2','Tajimas D','Syn','NonSyn','Sil'])
#writing output
with open("output.csv", "wt") as out:
    for line in dictio_df:
        print(line,file=out)
output.close()

Briefly: the script opens the file with results, create an output, use a set of keys (pieces of information that I want) to search each key inside the input file and put this in a dictionary, convert the dictionary in a data frame and save that data frame in an output file.

But I'm blocking with that error:

ValueError                                Traceback (most recent call last)
<ipython-input-11-ae4db977bd1a> in <module>()
     16 for line in file:
     17     for key in keys_order: # patterns are present in keys_order
---> 18         key, values = line.strip().split(":") # values are present after ':'
     19         dictio.setdefault(key, set()).update(values)
     20 # converting dictionary to dataframe

ValueError: too many values to unpack (expected 2)

Can anyone help me?

Tx

python pandas dictionary table • 622 views

ADD COMMENT • link 4.9 years ago by flogin ▴ 280