Hi Biostars people,
Some information:
I am in the middle of a proteomics experiment. I got my .raw LC-MS/MS files, my software set-up and everything is ready to go. The genome of the organism I am experimenting with is not yet sequenced, but the genome of a closely-related species is.
My idea is: use a gene-prediction database to search MS/MS data against (e.g. with Mascot). I would like to use gene-predictions of the published genome. Furthermore, redundant entries of 90% similarity would be removed and a common contaminants database would be added (https://maxquant.org/contaminants.zip). The protein sequences I could then compare to data of NCBI nr database using the NCBI Basic Local Alignment Search Tool with e.g. the R package Bio3d etc.
My Question (although very broad and hopefully very obvious to answer): How do I generate a gene-prediction database form publically available genomic data on NCBI?
Have mercy with me, I am new to most of this but I have a rough grasp of the jargon and the concepts.
Thanks for your time
First step is to predict the genes: https://en.wikipedia.org/wiki/List_of_gene_prediction_software
After that you would probably post a new question about the output. I used AUGUSTUS one time a while ago, it was easy to use. There is also an online version http://bioinf.uni-greifswald.de/augustus/submission.php
Which genome? NCBI genomes in general already have been annotated with protein predictions.
This is the genome I am talking about: https://www.ncbi.nlm.nih.gov/genome/17773
There is an old annotation here: https://i5k.nal.usda.gov/data/Arthropoda/limlun-%28Limnephilus_lunatus%29/Current%20Genome%20Assembly/
The NCBI genome, more recent and with more data, has not been annotated, though.