Getting in frame stop codons from fasta
1
0
Entering edit mode
7.5 years ago
User000 ▴ 690

Hello, I need to find a start codon and all in frame stop codons related to that primary start codon, i.e startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCTG_stopCodon(2)_ATCAAGA. This is my code, which gives me the positions of all ORF. How can I modify it to achive positions of start codon and all in frame stop codons related to that primary start codon.

from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def getORF(sequence, treshold, start_codons, stop_codons):
    orfs = []
    for j in range(0, 3):
        start_codon_index = 0
        end_codon_index = 0
        start_codon_found = False
        for indx in range(j, len(sequence), 3):
            current_codon = sequence[indx:indx+3]
            if current_codon in start_codons and not start_codon_found:
                start_codon_found = True
                start_codon_index = indx
                s=start_codon_index+1
            if current_codon in stop_codons and start_codon_found:
                end_codon_index = indx
                e=end_codon_index+3
                length = end_codon_index - start_codon_index + 3
                if length >= treshold * 3:
                    orfs.append(s)
                    orfs.append(e)
                start_codon_found = False
    return orfs
f = open("myfile.fa","r")
o = open("out.txt","w")
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
for record in SeqIO.parse(f,'fasta'):
    seq=record.seq
    #compl_seq =record.seq.reverse_complement()
    name=record.id
    orfs = getORF(seq, 30, start, stop)
    #complement_orfs = getComplementORF(compl_seq, 30, start, stop)
    print >> o, name, '+', orfs, '-', complement_orfs
python • 6.0k views
ADD COMMENT
1
Entering edit mode

Hi User000,

I'll try to give you some assistance in this, but I'm a bit confused by your example: startCodon_AGGAAG_stopCodon(1)_GAA**GGT**AAC**AGC**TCTG_stopCodon(2)_ATCAAGA Why is the part between stopCodon(1) and stopCodon(2) not in-frame? Why is there still a stretch of nucleotides after your final stopCodon?

Furthermore, how large is your input? That influences if we can do comfortably everything in memory or whether we have to think of writing memory-efficient scripts ;)

ADD REPLY
0
Entering edit mode

Hi, thanks a lot. These are my stop codons "TAA","TAG","TGA". I have a lot of artificial cds, (50K seq-s) and I would like to find a 1st start codon and all in frame stop codons related to that primary start codon..

ADD REPLY
0
Entering edit mode

Yes I see, but in your example, the second stop codon is not in frame...

So actually, you just have to search for the first start codon in the sequence. As soon as you found that, you can make in frame codons following that start codon. Then you just need to translate these codons and identify stop codons. Or am I oversimplifying things?

ADD REPLY
0
Entering edit mode

I explain better. I have CDS(those do not necessarily have a stop codon),so I add several bp artificially to my cds seq-s and want find all stop codons that are in frame with the 1st start codon. And i need positions of this start and stop codons..In the code above, it gives me start codon and stop codon (ORF), then it looks for the next start codon and stop codon. I dont know if it is explained better...

ADD REPLY
0
Entering edit mode

Might be caffeine deprivation of my brain, but it's not yet completely clear to me. Why do you add artificial nucleotides to the cds?

ADD REPLY
0
Entering edit mode

not artificial nucleotides :) I lengthen them for ie 100 bp obviously using coordinates from genome, in such a way creating kind of artificial CDS, so I am interested to see how far away are my stop codons, so I can evaluate whether to lenghen my cds o not...

ADD REPLY
0
Entering edit mode

So let me try to summarize: 0) input is fasta 1) first job: find the first start codon 2) then: find all stop codons in frame with this first start codon 3) report as: startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCT_stopCodon(2)_blablabla_finalStopCodon

ADD REPLY
0
Entering edit mode

resport as name_id pos_start_fr1 pos_stop1 pos_stop2, pos_start_fr2 pos_stop1 post_stop2

ADD REPLY
0
Entering edit mode

So if I explain roughly the algorithm is like that: take a sequence and look for start codon in 3 frames (actually 6 cos I need also complementary),find 1st start codon write it,search for in frame stop codon,write it,search for the next in frame stop codon write it..smth like this, does it make sense?

ADD REPLY
0
Entering edit mode

Do you need per sequence just one start codon, or the first start codon for each of the 6 reading frames?

ADD REPLY
0
Entering edit mode

start codon for each of the 6 frames if there is. so I prefer the output to be like this (+1 pos_start codons pos_stop_codon1 pos_stop_codon2, +2 pos_start codons pos_stop_codon1 pos_stop_codon2 ),otherwise I dont need frames written, but if divided by a comma I can understand that these are different frames..

ADD REPLY
0
Entering edit mode

Hahha, this thread xD

ADD REPLY
2
Entering edit mode

I'll probably write some real code with some pseudocode tomorrow to get him started. First thing I thought yesterday was "not another ORF script"...

ADD REPLY
0
Entering edit mode

What happens if there is more than one ATG in the fasta entry? Is only the first considered the start?

In which case, could the output look something like:

ID13123
    ORF 1: None
    ORF 2:
        Start - 342
        Ends - 453,499,589
    ORF 3:
        Start - 32
        Ends - 94,399,909,4302
ID13124
    ....
ADD REPLY
0
Entering edit mode

I would consider only the first ATG as start and would like results as in the code (start1 end1, start1 end2) or (start1, end1, end2).

ADD REPLY
4
Entering edit mode
7.5 years ago

After the lovely discussion above it's time to start thinking about answers :) I wanted to write a few pointers to help you write the script, but got carried away while writing and produced it almost completely. Left a few pieces for you to complete ;)

If anything is unclear, let me know.

As you can see I splitted the function(s) in multiple very simple functions which only do one job. Because that makes writing, thinking and debugging so much easier. Code is far more readable. You start by writing just the frame work. First you need readingframes. Then you need startcodons, then you need stopcodons. Every function does his part of the chain. The function makeframes() is for you to write. If you have question about how to solve it or change the code to suit your needs, let me know!

ADD COMMENT
0
Entering edit mode

thank you so much! Looks super difficult for me, as I don't use a lot of things you used. About frames, may be I can create 3 frames and at the end use compl_seq =record.seq.reverse_complement() as in my code above? Also, while thinking how to do it, I discovered that there is a orf predictor that does almost what I want, I wish I could have it's script, but I guess it is written in perl.

ADD REPLY
1
Entering edit mode

Take your time to go through it and I'd be happy to explain parts and functions you don't understand. But with perl stuff I can't help you ;)

Yes, you can use the .reverse_complement() method BUT I already converted each sequence to a string in line 48 in the list comprehension (my favorite construction in python). So just using .reverse_complement() no longer works, because a string is not a seq object and doesn't have a .reverse_complement() method. But of course you can change line 48 so the object remains a seq object by removing the str() function. But you need to convert it to a string then in the makeframes() function, because the code needs strings downstream. This is because a seq object doesn't have a .index() method. Is that clear?

ADD REPLY
0
Entering edit mode

Yes, more or less! I might come back to you later with questions!THNAKS

ADD REPLY
0
Entering edit mode

Dear @Wouter, the "lovely discussion" you have mentioned was cool!

And explaining different parts of the valuable script would be a class for me, too

Thank you

ADD REPLY
1
Entering edit mode

For "educational purposes" I added comments to the code above explaining almost every line, let me know if something isn't clear.

ADD REPLY
0
Entering edit mode

Hi! So,I added makeframes function in the code above, smth like this:

for n in range(0,len(sequence),3):
    frames = sequence[n:n+3]
return([sequence])

It is giving me the result as seq-s, while I want ----- > a header, position_start_codon, pos_stop,pos_next_stop How to modify?thanks

ADD REPLY
0
Entering edit mode

That makeframes() function doesn't make sense. You are just returning sequence in a list, and your frames = sequence[n:n+3] is not doing what you think it does.

For getting the position you'll need to use the index, for example using enumerate.

ADD REPLY
0
Entering edit mode

for frame in range(0,3): return([sequence])

?

ADD REPLY
0
Entering edit mode

You only need 3 reading frames? Then why not just

def makeframes(sequence):
    return([sequence, sequence[1:], sequence[2:]])
ADD REPLY
0
Entering edit mode

I need 6, but I was thinking to use reserve_compl later on..

ADD REPLY
0
Entering edit mode

Possible, but I think it would make more sense to do it in this step and return 6 reading frames. However, you already decided that you need positions rather than sequences and you are free to write this script as you feel like :)

ADD REPLY
0
Entering edit mode

Ok, I try.. thanks you

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6