Question

Finding protein- encoding genes in a sequence

0

Entering edit mode

5.6 years ago

Gina ▴ 10

Hi,

How can I find out how many protein-encoding genes I have in my sequence?

sequence blast ORF • 2.0k views

ADD COMMENT • link updated 5.6 years ago by marongiu.luigi ▴ 710 • written 5.6 years ago by Gina ▴ 10

1

Entering edit mode

This sounds like an assignment. If so you should state that clearly. Have you been asked to write a program or just do this using blast (since you included that as a tag)?

Problem description seems clear enough so you should be thinking of solving the questions on your own. Since you don't have access to your input data you have to go forward on your own.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Depends on what organism the sequence is from. For Eukaryotes you may need to use a gene prediction tool like GENSCAN, EGPRED etc. For prokaryotes you could simply look for start/stop sites and/or use a gene prediction tool like GLIMMER. The predictions can then be followed by sequence searches to identify the genes. If you don't need to precisely predict gene models then a simple blast search may suffice.

Edit: I am going to keep my original comment since you edited the original post.

ADD REPLY • link 5.6 years ago by GenoMax 141k

score 1 · Answer 1 · 2018-09-27

1

Entering edit mode

5.6 years ago

marongiu.luigi ▴ 710

You could try with the EMBOSS command showseq. For instance:

echo atgtttcaggacccacaggagtaa | showseq -filter -threeletter y -format 4

         10        20        
----:----|----:----|----
atgtttcaggacccacaggagtaa
MetPheGlnAspProGlnGlu***

But you can get the all three/six reading frames and single code:

echo atgtttcaggacccacaggagtaa | showseq -filter -format 5

               10        20        
      ----:----|----:----|----
      atgtttcaggacccacaggagtaa

      M  F  Q  D  P  Q  E  *                                      
       C  F  R  T  H  R  S  X                                     
        V  S  G  P  T  G  V  X          
echo atgtttcaggacccacaggagtaa | showseq -filter -format 6


               10        20        
      ----:----|----:----|----
      atgtttcaggacccacaggagtaa

      M  F  Q  D  P  Q  E  *                                      
       C  F  R  T  H  R  S  X                                     
        V  S  G  P  T  G  V  X                                    
      ----:----|----:----|----
       X  N  *  S  G  C  S  Y                                     
      X  T  E  P  G  V  P  T                                      
        H  K  L  V  W  L  L  L

On Linux platforms, EMBOSS is installed with sudo apt install emboss.

ADD COMMENT • link 5.6 years ago by marongiu.luigi ▴ 710

2

Entering edit mode

Can you mention some of the caveats to using this, such as the fact that an open reading frame doesn't always correspond to a gene and that this ignores splicing?

ADD REPLY • link 5.6 years ago by Devon Ryan 104k

0

Entering edit mode

Why would I look at ORFs? This is invalid for model/known organisms and even for de novo sequencing, I'd much later BLAST to closely related organisms than use ORFs to generate theoretical peptide sequences. Not every ATG is going to be a Methionine, you know?

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

but isn't looking for ORF the procedure for start characterizing a genome? Isn't the standard approach something like this figure (taken from Danos et al. EMBO J 1982;1(1):231)? enter image description here

Does the procedure give an exact number? No, it is (both in the figure than with showseq) a graphical representation, but I think is a beginning.

I assumed the sequence was given as fasta format; should one have a GenBank format, it would be possible to get the number of genes with something like (on a unix terminal):

grep 'gene.*[0-9]*\.\.[0-9]' <file>

For instance, saving the data for NC_009333.1 (Human herpesvirus 8) on hhv8.gb, I get:

$ grep 'gene.*[0-9]*\.\.[0-9]' hhv8.gb | wc -l
96

ADD REPLY • link 5.6 years ago by marongiu.luigi ▴ 710

1

Entering edit mode

For simple genomes (where one does not expect genes to be composed of exons/introns) a simplistic approach of looking for ORF is fine. Even then not every open reading frame would be expected to code for a protein. Which is the reason methods using HMM's were developed (even for prokaryotes) to predict coding regions.

ADD REPLY • link 5.6 years ago by GenoMax 141k