Question

Parse fasta header with regex in python

0

Entering edit mode

6.7 years ago

David ▴ 230

Hi, I would like to parse a fasta file and get all headers and seqs that match some strings (so called pattern below).

What happens is that all the headers match and the script returns all sequences so it´s not working as expected.

Can you help finding what´s wrong ??

# biopython                                                                                                                                                                          
from Bio import SeqIO

# regex library                                                                                                                                                                      
import re

# file with FASTA sequence                                                                                                                                                           
infile = "seq.fa"

# File looks like this
#>1238344 mouse
#aagctacgacatcagctaca
#>1238344 homo sapiens
#ttagcatcagcatcagctacta


# pattern to search for                                                                                                                                                              
#pattern = "sapiens|mouse"                                                                                                                                                           
pattern="Eukaryota|metagenome|[Homo Sapiens]|[Mus musculus]|[Rattus norvegicus]|Rhizobium|Gorilla|beringei|thaliana|[Oryza sativa]|Dictyostelium|mitochondria|Equus caballus|Plasmod\
ium falciparum|Drosophila melanogaster"
# look through each FASTA sequence in the file                                                                                                                                       
for seq_record in SeqIO.parse(infile, "fasta"):


    matches = re.findall(pattern, seq_record.description, re.I)
    if matches:
        #print "Matches = ", len(matches),"\n",                                                                                                                                      
        print ">",seq_record.id," ",seq_record.description, "\n",
        print seq_record.seq,"\n",

Thanks

biopython python fasta regex • 5.8k views

ADD COMMENT • link updated 6.7 years ago by Andrzej Zielezinski 11k • written 6.7 years ago by David ▴ 230

0

Entering edit mode

What happens is that all the headers match and the script returns all sequences so it´s not working as expected.

Providing some examples would help. And your code snippet is sketchy at best.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Yes, there are definitely important parts missing from that code.

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry guys my copy and paste didn´t work as expected. Hope it´s better now.

ADD REPLY • link 6.7 years ago by David ▴ 230

0

Entering edit mode

if matches in this case will always evaluate to True, therefore it will print everything in the input file.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

If i add the following sequence which is not in my pattern list (sorry guys forgot to add it). It is also returned.

#>1238344 mouse
#aagctacgacatcagctaca

ADD REPLY • link 6.7 years ago by David ▴ 230

1

Entering edit mode

Running your above script, and adding a print line to print matches (before the if statement), yields this:

[' ', 'm', 'o', 'u', 's', 'e']
> 1238344   1238344 mouse
aagctacgacatcagctaca
[' ', 'h', 'o', 'm', 'o', ' ', 's', 'a', 'p', 'i', 'e', 'n', 's']
> 1238344   1238344 homo sapiens
ttagcatcagcatcagctacta

Don't use re.findall in this instance. Create a list, with each string you want to search for, then for each record in the input fasta, check if it is contained in the list.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks, Can you give a small example, don´t really know how to do that ?

ADD REPLY • link 6.7 years ago by David ▴ 230

score 7 · Accepted Answer · 2017-08-23

7

Entering edit mode

6.7 years ago

Andrzej Zielezinski 11k

This should do the work:

from Bio import SeqIO                                                                                                                                                         
infile = "seq.fa"

patterns = ["Eukaryota", "metagenome", "Homo Sapiens", 
  "Mus musculus", "Rattus norvegicus", "Rhizobium",
  "Gorilla", "beringei", "thaliana", "Oryza sativa",
  "Dictyosteliu", "mitochondria", "Equus caballus",
  "Plasmodium falciparum", "Drosophila melanogaster"
]

for seq_record in SeqIO.parse(infile, "fasta"):
    for p in patterns:
        if p.lower() in seq_record.description.lower():
            print(seq_record.format("fasta"))
            break

ADD COMMENT • link 6.7 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Great thanks for the answer

ADD REPLY • link 6.7 years ago by David ▴ 230

1

Entering edit mode

If this answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Not really important but this should be even better :

from Bio import SeqIO
infile = "./test.fasta"

patterns = ["Eukaryota", "metagenome", "Homo Sapiens",
  "Mus musculus", "Rattus norvegicus", "Rhizobium",
  "Gorilla", "beringei", "thaliana", "Oryza sativa",
  "Dictyosteliu", "mitochondria", "Equus caballus",
  "Plasmodium falciparum", "Drosophila melanogaster"
]

patterns = [pattern.lower() for pattern in patterns]
for seq_record in SeqIO.parse(infile, "fasta"):
    desc = seq_record.description.lower()
    if any(pattern in desc for pattern in patterns) :
        print(seq_record.format("fasta"))

ADD REPLY • link 6.7 years ago by jsgounot ▴ 170