I'm hoping to get some advice on the best way to build out a gene set using the UniprotKB dataset. My set needs to include both genes and proteins, which UniprotKB provides nicely, but the only real identifier they have for genes is a plain text primary name and sometimes just an ORF name or ordered locus name. I could use these identifiers to create my set of genes, but I have a feeling there will be problems with such an approach.
For example:
- What if for one organism, there are two distinct genes with the same primary name, ordered locus name, orf name etc?
- What if names drastically change from one release to another?
I know that many of the genes link to other databases, such as Entrez Gene with a stable set of "gene identifiers", but it's definitely not all of them. Also, I'd prefer to just stick with UniprotKB if possible instead of having to mix and match multiple resources, but if anyone has experience with this, I'm wide open to suggestions. What have you guys done in similar situations?