Biostar Beta. Not for public use.
Python: grep the strings whiitn "[ ]"
0
Entering edit mode
2.9 years ago
horsedog • 30

Hi ,I'm using bioinformatics tool parsing my sequences, here I'd like to extract some information i need. There are thousands of query names corresponding to different sequences, like this

  >lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

What I need is "[location=(207..914)]" ; How I can achieve this? In different sequences the name would be different, I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one, and sometimes there is no "locations", meaning no cds in this sequence so just give it a miss. I'm thinking to use "grep" or "re.search" but it didn't work:

for line in open(file,"r").readlines():   
  if "location=" in line:  
    cds = grep “[location = *]” line  
  print(cds)

Does anyone have idea?
Many thanks!

python • 575 views
ADD COMMENTlink
0
Entering edit mode

If you want to stick with re then

for line in open("test", "r").readlines():
        if "location" in line:
                loc = re.split(r" ", line)
                for m in loc:
                        if "location" in m:
                                print(m)
ADD REPLYlink
0
Entering edit mode

good one:). Further shortening the code:

import re
for line in open("test.txt", "r").readlines():
    if "location" in line:
        print(line.split()[5])

output:

[location=(207..914)]
[location=(2070..9140)]
[location=(20700..91400)]

input:

$ cat test.txt
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]
ADD REPLYlink
0
Entering edit mode

I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one

otherwise: grep -e 'location' myfile.fasta | cut -f 6 -d ' ' > locations.txt

ADD REPLYlink
1
Entering edit mode
20 months ago
st.ph.n ♦ 2.5k
Philadelphia, PA

Grep is not a Python command. If you're sticking with Python, and not bash commands, here's a quick strip to get you started:

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
        for line in f:
                # find FASTA headers. 
                if line.startswith(">"):
                        # check if 'location' in header
                        if 'location' in line:
                                # split header by spaces into list
                                x = line.strip().split(' ')
                                # for each item in header check if 'location' is in that item
                                for i in x:
                                        if 'location' in i:
                                                print i

Prints:

[location=(207..914)]

save as find_loc.py, run as python find_loc.py myfile.fasta > locations.txt

ADD COMMENTlink
1
Entering edit mode
18 months ago
India
>>> import re
>>> import os
>>> with open ("test.txt","r") as t:
    f=t.read()
>>> pattern=re.compile('\[location=\([0-9]+..[0-9]+\)\]')
>>> re.findall(pattern, f)

output:

===========================

['[location=(207..914)]',
 '[location=(2070..9140)]',
 '[location=(20700..91400)]',
 '[location=(207000..914000)]']

==============================

>>> print (f)

=====================================

output:

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207000..914000)] [gbkey=CDS]

======================================

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1