I have a draft assembly. Does anyone know of scripts to retrieve all ORF in protein format from a a fasta file of contigs?
Unless this is a prokaryote, getting all open reading frames from a draft assembly is not an informative analysis (splicing, low gene density), and also for bacteria it is of very limited use. Instead, look for gene prediction , e.g. on BioStar: gene-prediction
I work on microsporidia. They have very little introns (up to 20 genes with introns, some none at all) and small genomes. They are Eukaryotes.
Then you can use getorf as suggested by R@hul, you should still attempt to do a proper gene prediction.
Please share the fully functional perl script to translate CDNA to ORF (protein) selecting the longest one only. I have Active Perl installed.
From EMBOSS toolkit:
getorf -sequence genome.fasta -outseq genome.ORFs -minsize 180 -find 1 &
Login before adding your answer.