Off topic:Dictionary to Matrix
1
0
Entering edit mode
4.9 years ago
flogin ▴ 280

I have received an archive with several results from DNAsp, the content of the file is like that (only the firsts lines, the original archive have zillions of lines.:

     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 0   (Total number of mutations: 0)
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,0000000000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 0
     Number of significant pairwise comparisons by Fisher's exact test: 0
     Number of significant pairwise comparisons by chi-square test: 0
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 11   (Total number of mutations: 11)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 55
     Number of significant pairwise comparisons by Fisher's exact test: 51
     Number of significant pairwise comparisons by chi-square test: 51
     Value of ZnS (Kelly 1997): 0,4058
     Value of Za (Rozas et al. 2001): 0,5058
     Value of ZZ (Rozas et al. 2001): 0,1001
      r^2 values:  Y = 0,4200 - 0,0668X   (55 points)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
        Nucleotide diversity, Pi: 0,00662
     Tajima's D: 2,27081     Statistical significance: *, P < 0.05
     Coding region: Tajima's D: 2,27081     *, P < 0.05
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
   Variable (polymorphic) sites: 20   (Total number of mutations: 20)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Number of pairwise comparisons: 190
 Number of significant pairwise comparisons by Fisher's exact test: 1
 Number of significant pairwise comparisons by chi-square test: 145
 Value of ZnS (Kelly 1997): 0,6608
 Value of Za (Rozas et al. 2001): 0,7791
 Value of ZZ (Rozas et al. 2001): 0,1183
  r^2 values:  Y = 0,6736 - 0,0530X   (190 points)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
    Nucleotide diversity, Pi: 0,00642
 Tajima's D: -1,83426     Statistical significance: *, P < 0.05
 Coding region: Tajima's D: -1,83426     *, P < 0.05
 NonSynonymous sites: Tajima's D(NonSyn): -1,74110     *, P < 0.05

As we can see, had several redundant lines (Input data file, for example), and the specific information is always after ":". To analyze those results, I want to make a table with several pieces of information (name of the file, number of sites, and the results of each evolutive test (Tajima, Fu and Li's and Linkage Disequilibrium).

I have a little experience with Python, and I think that a method with python dictionary and conversion to data frame can be a good choice to resolve my problem. So I wrote that script:

# -*- coding: utf-8 -*-
import pandas as pd
# opening txt file
file = open("GST71_OK.txt","r")
#creating output file
output = open('output.csv','w+')
# creating dictionary keys order
keys_order = ["Input Data File","Number of sites","Variable (polymorphic) sites", "Nucleotide diversity, Pi",
              "Value of ZnS", "Value of Za","Value of ZZ","r^2 values","Number of pairwise comparisons",
             "Fisher's exact test","chi-square test","Tajima's D","Tajima's D(Syn)","Tajima's D(NonSyn)",
              "Tajima's D(Sil)"]

# creating dictionary
dictio = dict()
# searching patterns by line
for line in file:
    for key in keys_order: # patterns are present in keys_order
        key, values = line.strip().split(":") # values are present after ':'
        dictio.setdefault(key, set()).update(values)
# converting dictionary to dataframe
dictio_df = pd.DataFrame.from_dict(dictio, orient='index', 
                           columns=['Input','Total Sites', 'Polymorphic Sites', 'Pi', 'ZnS', 'Za','ZZ', 'r^2','Pairwise Comparisons',
                                    'Fischer','Chi^2','Tajimas D','Syn','NonSyn','Sil'])
#writing output
with open("output.csv", "wt") as out:
    for line in dictio_df:
        print(line,file=out)
output.close()

Briefly: the script opens the file with results, create an output, use a set of keys (pieces of information that I want) to search each key inside the input file and put this in a dictionary, convert the dictionary in a data frame and save that data frame in an output file.

But I'm blocking with that error:

ValueError                                Traceback (most recent call last)
<ipython-input-11-ae4db977bd1a> in <module>()
     16 for line in file:
     17     for key in keys_order: # patterns are present in keys_order
---> 18         key, values = line.strip().split(":") # values are present after ':'
     19         dictio.setdefault(key, set()).update(values)
     20 # converting dictionary to dataframe

ValueError: too many values to unpack (expected 2)

Can anyone help me?

Tx

python pandas dictionary table • 622 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6