Biostar Beta. Not for public use.
HGNC cross-references in UniProt
0
Entering edit mode
14 months ago
cdsouthan ♦ 1.8k

There are 19035 protein-coding rows in the HGNC download but the UniProt 19035 column collapses to 18883 infering 432 one-to-many Swiss-Prot > HGNC

However, when I query UniProt with database:(type:hgnc) AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" I get 19960 from the 20,168, implying 905 for the same 1:many - but I can only find 152 duplicates in the column

Can amyone whos been doing something similar help out here? (note it falls between two help desks)

ADD COMMENTlink
0
Entering edit mode

After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)

ADD REPLYlink
0
Entering edit mode

Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome

ADD REPLYlink
1
Entering edit mode
14 months ago
me • 690
Switzerland

In UniProt release 2017_02 there are 171 UniProt/Swiss-Prot entries with more than one HGNC link. While 52 HGNC links point to more than one UniProtKB/Swiss-Prot entry

For data on the HGNC side unfortunately it misses a SPARQL endpoint so no nice way to do this kind of analytics.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
    ?protein 
    (GROUP_CONCAT(SUBSTR(STR(?db),30);separator=',') AS ?hgncs)
WHERE
{
   ?protein a up:Protein .
   ?protein up:reviewed true .
   ?protein rdfs:seeAlso ?db .
   ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?protein HAVING (COUNT(DISTINCT(?db)) >1)

The inverse query asking for hgnc links present in more than one UniProtKB/Swiss-Prot entry.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
?db 
(GROUP_CONCAT(SUBSTR(STR(?protein),33);separator=',') AS ?proteins)
WHERE
{
  ?protein a up:Protein .
  ?protein up:reviewed true .
  ?protein rdfs:seeAlso ?db .
  ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?db HAVING (COUNT(DISTINCT(?protein)) >1)
ADD COMMENTlink
0
Entering edit mode

OK, thanks, but the biological/curation issue behind the numbers above is as follows:

It looks like Swiss-Prot have included a large number of proteins (in the order of ~ 500-800) that HGNC are not classifying as protein-coding. The largest categories I think (by manual inspection of matches from segments from the Venn I put on twitter) are endogenous retrovirus, long non-coding RNAs and odour receptor pseudogenes. This is numerically dominant over the relatively small one-to-many (SP < > HGNC in both directions as Jerv shows) which I think they agree on as proteins.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1