Hello everyone !
In order to perform a test of the tool RATT (Rapid Annotation Transfer Tool : a tool that use an annotation of a close relative for a given species in order to transfer it on a non studied specie, it's a kind of an approximation of annotation. RATT Is used following this tutorial : http://avrilomics.blogspot.fr/2013/04/training-augustus-gene-finding-software.html ), I need a EMBL file that contain only a gene subset of my choice. I'm working, again, on D. melanogaster (close relative) and D. suzukii (non studied)
For example, in my EMBL files, I have numerous identifiers, like FBgn0000120, etc, then the translation of the genes as well as a lot of features, and at the end of the file, the whole nucleic sequence of the chromosome. I want to keep only few CDS features, like a subset of 10 genes that have to be in EMBL format at the end.
Do someone has any idea of how to do this ? I made some researches, but didn't find any thing interesting.
Thanks for your help !
Roxane
If you have familiarity with Perl, you can use BioPerl to parse the EMBL file, the Beginners How-To should help you to write the script quickly.
Another quick and dirty option is to parse the file directly, using Perl's input record separator to read the file record by record (something like
$/ = "\n//";
), and then output only the records you want.