Question

Are there some ways to extract information in tabular format from text with regular pattern?

0

Entering edit mode

5.8 years ago

rimgubaev ▴ 330

Hello everyone! I have an output from NsiteM which represents a text with promoter analysis results. The text has a regular pattern:

           1. QUERY: Gene1
         Length of Query Sequence:        500 bp      | Nucleotide Frequencies:  A -  0.37   G -  0.15   T -  0.27   C -  0.21
         TFBS AC: RSP00204//OS: arabidopsis (Arabidopsis thaliana) /GENE: AtEm6/TFBS: ABRE/6.2 /BF: ABI5
                   Found in    5 (100.00 %) SEQs (out of    5)
         Motifs on "+" Strand: Mean Exp. Number   0.00232    Found  1
             393  GACACGTGtC      402 (Mism.= 1)
         TFBS AC: RSP00219//OS: arabidopsis (Arabidopsis thaliana) /GENE: RBCS-1A/TFBS: G box-1 /BF: HY5; Arabidopsis bZIP protein - transcriptionl factor
                   Found in    5 (100.00 %) SEQs (out of    5)
         Motifs on "+" Strand: Mean Exp. Number   0.00799    Found  1  
             392  gGACACGTGtCA      403 (Mism.= 2)

    Totally      2 motifs of     2 different TFBSs have been found
    ------------------------------------------------------------
       2. QUERY: Gene2
     Length of Query Sequence:        500 bp      | Nucleotide Frequencies:  A -  0.37   G -  0.17   T -  0.26   C -  0.19
     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
               Found in    3 ( 60.00 %) SEQs (out of    5)
     Motifs on "+" Strand: Mean Exp. Number   0.00239    Found  1
         299  GCCACGTGGC      308 (Mism.= 0)
     Motifs on "-" Strand: Mean Exp. Number   0.00239    Found  1
         308  GCCACGTGGC      299 (Mism.= 0)
     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
               Found in    3 ( 60.00 %) SEQs (out of    5)
     Motifs on "+" Strand: Mean Exp. Number   0.00258    Found  1
         300  CCACGTGGCa      309 (Mism.= 1)
     Motifs on "-" Strand: Mean Exp. Number   0.00221    Found  1
         307  CCACGTGGCa      298 (Mism.= 1)

     Totally      4 motifs of    2 different TFBSs have been found
    ------------------------------------------------------------
etc.

So my question is how to extract the data on Cis-regulatory elements from text above into the following table:

    QUERY     |  TFBS AC   |  Strand  |  Mean Exp. Number  | 
    --------- | ---------- | -------- | ------------------ |
    Gene1     |  RSP00204  |     +    |        0.00232     |
    Gene1     |  RSP00219  |     +    |        0.00799     |
    Gene2     |  RSP00073  |     +    |        0.00239     |
    Gene2     |  RSP00073  |     -    |        0.00239     |
    Gene2     |  RSP00153  |     +    |        0.0025      |
    Gene2     |  RSP00153  |     -    |        0.00221     |

Is it better to master Python scripting to solve such a problem or it is possible to solve it by sed, for example? Or maybe there are already ready solutions?

Thank you in advance!

text mining promoter analysis NsiteM • 1.3k views

ADD COMMENT • link 5.7 years ago by rimgubaev ▴ 330

0

Entering edit mode

5.8 years ago

Joe 21k

I think you'll want a fairly heavy duty python script for this.

Here are 2 links to some of my own code for parsing files which have a similar periodic structure. You may be particularly interested in the pandas.read_fwf() tool, which can take files that have fixed width columns.

https://github.com/jrjhealey/bioinfo-tools/blob/master/tabulateHHpred.py#L37-L68

The second script shows (what I think is) a neat way to deal with files that are periodic in structure by basically chunking it in to several different sections with regexes and behaviours for each (links to sections of most interest):

https://github.com/jrjhealey/MultiGeneBlastParser/blob/master/MGBparser.py#L264-L272

https://github.com/jrjhealey/MultiGeneBlastParser/blob/master/MGBparser.py#L280-L288

The latter script throws everything in to Pandas dataframes which makes them nice and easy to play with after too which might be of interest.

ADD COMMENT • link 5.7 years ago by Joe 21k

score 3 · Accepted Answer · 2018-08-07

Solved this problem! I've written a small python scrip which generates tab-formatted text using Nsite output. Here is the code:

import sys
import re

a = sys.argv[1] # reading of input filename (first argument)

b = sys.argv[2] # reading of output filename (second argument)

ouf = open(b, 'w+') # open output file for writing

ouf.write('GeneID\tLength\tFactorID\tStrand\tMean_exp_value\tFound\n') # set the columns names

with open(a) as inf:
    for line in inf:
        line = line.strip()
        if line[0:5] == 'QUERY':
            # saving gene name into GeneID variable until next gene name
            GeneID = (line[7:len(line)]) 
        if line[0:4] == 'TFBS':
            # saving TF binding site ID name into FactorID variable until next TF binding site
            FactorID = line[9:17] 
        if line[0:25] == 'Length of Query Sequence:':
            # saving the promoter sequence length (by regex)
            Length = re.search('\d+\sbp', line).group(0) 
        if 'Motifs' in line:
            # saving TF binding site ID name into FactorID variable until next TF binding site
            Strand = line[line.find('" Strand:')-1] 
            # saving the Mean Exp. Number
            Mean_exp_value = re.search('\d\.\d+',line).group(0) 
            Found = re.search('Found\s*\d+', line) 
            # saving the number of TF binding sites
            Found = re.search('\d+', Found.group(0)).group(0) 
            # adding a string in tab-separated format
            ouf.write(GeneID + '\t' + Length + '\t' + FactorID + '\t' 
                      + Strand + '\t' + Mean_exp_value + '\t' + Found +'\n')           
ouf.close()

Script is also available here with example files.