Hello, I need to find a start codon and all in frame stop codons related to that primary start codon, i.e startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCTG_stopCodon(2)_ATCAAGA. This is my code, which gives me the positions of all ORF. How can I modify it to achive positions of start codon and all in frame stop codons related to that primary start codon.
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def getORF(sequence, treshold, start_codons, stop_codons):
orfs = []
for j in range(0, 3):
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
for indx in range(j, len(sequence), 3):
current_codon = sequence[indx:indx+3]
if current_codon in start_codons and not start_codon_found:
start_codon_found = True
start_codon_index = indx
s=start_codon_index+1
if current_codon in stop_codons and start_codon_found:
end_codon_index = indx
e=end_codon_index+3
length = end_codon_index - start_codon_index + 3
if length >= treshold * 3:
orfs.append(s)
orfs.append(e)
start_codon_found = False
return orfs
f = open("myfile.fa","r")
o = open("out.txt","w")
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
for record in SeqIO.parse(f,'fasta'):
seq=record.seq
#compl_seq =record.seq.reverse_complement()
name=record.id
orfs = getORF(seq, 30, start, stop)
#complement_orfs = getComplementORF(compl_seq, 30, start, stop)
print >> o, name, '+', orfs, '-', complement_orfs
Hi User000,
I'll try to give you some assistance in this, but I'm a bit confused by your example:
startCodon_AGGAAG_stopCodon(1)_GAA**GGT**AAC**AGC**TCTG_stopCodon(2)_ATCAAGA
Why is the part between stopCodon(1) and stopCodon(2) not in-frame? Why is there still a stretch of nucleotides after your final stopCodon?Furthermore, how large is your input? That influences if we can do comfortably everything in memory or whether we have to think of writing memory-efficient scripts ;)
Hi, thanks a lot. These are my stop codons "TAA","TAG","TGA". I have a lot of artificial cds, (50K seq-s) and I would like to find a 1st start codon and all in frame stop codons related to that primary start codon..
Yes I see, but in your example, the second stop codon is not in frame...
So actually, you just have to search for the first start codon in the sequence. As soon as you found that, you can make in frame codons following that start codon. Then you just need to translate these codons and identify stop codons. Or am I oversimplifying things?
I explain better. I have CDS(those do not necessarily have a stop codon),so I add several bp artificially to my cds seq-s and want find all stop codons that are in frame with the 1st start codon. And i need positions of this start and stop codons..In the code above, it gives me start codon and stop codon (ORF), then it looks for the next start codon and stop codon. I dont know if it is explained better...
Might be caffeine deprivation of my brain, but it's not yet completely clear to me. Why do you add artificial nucleotides to the cds?
not artificial nucleotides :) I lengthen them for ie 100 bp obviously using coordinates from genome, in such a way creating kind of artificial CDS, so I am interested to see how far away are my stop codons, so I can evaluate whether to lenghen my cds o not...
So let me try to summarize: 0) input is fasta 1) first job: find the first start codon 2) then: find all stop codons in frame with this first start codon 3) report as: startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCT_stopCodon(2)_blablabla_finalStopCodon
resport as name_id pos_start_fr1 pos_stop1 pos_stop2, pos_start_fr2 pos_stop1 post_stop2
So if I explain roughly the algorithm is like that: take a sequence and look for start codon in 3 frames (actually 6 cos I need also complementary),find 1st start codon write it,search for in frame stop codon,write it,search for the next in frame stop codon write it..smth like this, does it make sense?
Do you need per sequence just one start codon, or the first start codon for each of the 6 reading frames?
start codon for each of the 6 frames if there is. so I prefer the output to be like this (+1 pos_start codons pos_stop_codon1 pos_stop_codon2, +2 pos_start codons pos_stop_codon1 pos_stop_codon2 ),otherwise I dont need frames written, but if divided by a comma I can understand that these are different frames..
Hahha, this thread xD
I'll probably write some real code with some pseudocode tomorrow to get him started. First thing I thought yesterday was "not another ORF script"...
What happens if there is more than one ATG in the fasta entry? Is only the first considered the start?
In which case, could the output look something like:
I would consider only the first ATG as start and would like results as in the code (start1 end1, start1 end2) or (start1, end1, end2).