Entering edit mode
7.5 years ago
zebudavid
▴
10
Hi, I am new to Biopython and I've been trying to explore the capabilities of the SeqIO function to iterate over a FASTA file, more specifically on a Regex( regular expressions) task. What I need to do is find PolyQ aglomerations in Human proteome. Here is what I have right now (using latest NCBI proteome ftp):
import re
from Bio import SeqIO
def reader():
for seq_record in SeqIO.parse("Gnomon_prot_micro.fsa", "fasta"):
sequence = (str(seq_record.seq))
print sequence #just to verify
gene_name = seq_record.id)
print gene_name #just to verify
compiler = re.compile('QQQ+')
while sequence: #do I need to start a new cycle to iterate over sequence ?
read = re.finditer(compiler,sequence)
for m in read:
print m.start(), m.group() # need to get bulk position and how many Qs
Is there a better way to iterate over the document to obtain only the sequences? ( I am relating every sequence with its respective gene name later). Sorry for being a noob and correct me if anything.
It's better to compile the pattern for
re
out of the for loop, you only need to do that once.I'm not sure why you use the while loop... Which output do you aim to obtain?
Hey, thanks for answering! The while loop is just there for the idea, I aim to obtain a document that will display for every gene :
Since I do not want 'code service' , could you please just reference what your suggestions would be ?
I think you should remove the while loop or adapt it. Since sequence will always be "
True
" you have here effectively an endless loop. What about usingm.span()
for getting both start and end (which then also gives you the length).While looping over your iterator
read
you will have to increment a counter to track how many matches you have.Am I making sense?
I found a working solution! :
any added suggestions or comments to improve ?
I'm not sure why you add brackets around
seq_record.id
andstr(seq_record.seq)
but okay probably shouldn't matter. It's not crucial, but I would add thecompiler = re.compile('QQ+')
line just above the for loop but in the reader function (perhaps a more precise function name is a good idea but that's a detail).You will now need to nicely format your output but that's straightforward I guess.
Thanks! Yeah, I'll format it .