I am trying to compile a list of known peptide hormones and neuropeptides from sequences deposited into NCBI's Genbank and in Nucleotide databases. I am looking to extract peptide sequences of the entire coding region, as well as the sequences of the signal peptide and mature peptide(s).
I could do this manually with only a few dozen genes at the moment; however, I know there are in excess of 100, possibly as much as ~500 total genes of interest. If I compile a list of genes, is there already a program or workflow to batch extract this information from NCBI? Are there resources you can link for me so I may get started?
Here's an example. GenBank contains the complete coding sequence (mRNA) for the INS gene, encoding for proinsulin, which is the pre-cursor to insulin (active hormone). On that GenBank page are a number of "features" of that sequence. The CDS is the full product of the INS gene, composed of the signal peptide and the mature peptide. I would like to extract these three features.
I added an example, and your comment was instructive. I did not know much about the UniProt database. I was able to get most of the information I needed from this search.
I edited my reply with a NCBI solution as well.
Thanks for the edit. I've accepted your answer. :)