Finding motifs in fasta file with Biopython
0
0
Entering edit mode
3.5 years ago
lorenzinip • 0

Hi I have a large FASTA file which looks like this

>EMBOSS_001
GTCATCACAGTTTTCCCCGCCCTGTATATGGCTAATAGGCCCTCGCAATCTCCGATAAAT
>EMBOSS_002
CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003
CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004
CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT

I would like to find all the sequences which have a specific motif and eliminate them. For example if the motifs is TTTCCC, the expected output should be:

>EMBOSS_002 CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003 CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004 CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT

I wrote a code with Biopython:

from Bio.Seq import Seq
import Bio.motifs as motifs
from Bio import SeqIO

instances = [Seq("TTTCCC")]
m = motifs.create(instances)

reads = list(SeqIO.parse("/Users/EMBOSS-6.6.0/emboss/genome.fa", "fasta"))

for i in range(len(reads)):
    for pos, seq in m.instances.search(reads[i].seq):
        print("%i %s" % (pos, seq))

However it returns me only the info of the position of the start of the motifs, 11 TTTCCC

I would like to return the info of also the sequence where it was found:

EMBOSS_001 11 TTTCCC

In addition, I would like the code to eliminate that sequence where the motif was found.

biopython • 1.5k views
ADD COMMENT
1
Entering edit mode

Unless you're doing something more complicated you haven't told us, you don't need to use Bio.motifs.

You can just string search:

motif = "TTTCCC"

for rec in SeqIO.parse("/path/to/file.fasta", "fasta"):
    if not motif in rec.seq:
        rec

You can change rec to a print or whatever you need to do with that specific record; calling the rec directly should print the summary. It's not clear to me if you need the indexes for the match since you've said you just intend to throw the record away, but if you do you can simply use index.

(Code untested).

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6