Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

HGNC cross-references in UniProt
1
0
Entering edit mode
4.4 years ago
cdsouthan ★ 1.8k

There are 19035 protein-coding rows in the HGNC download but the UniProt 19035 column collapses to 18883 infering 432 one-to-many Swiss-Prot > HGNC

However, when I query UniProt with database:(type:hgnc) AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" I get 19960 from the 20,168, implying 905 for the same 1:many - but I can only find 152 duplicates in the column

Can amyone whos been doing something similar help out here? (note it falls between two help desks)

HGNC human proteins uniprot • 1.1k views
0
Entering edit mode

After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)

0
Entering edit mode

Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome

1
Entering edit mode
4.4 years ago
me ▴ 740

In UniProt release 2017_02 there are 171 UniProt/Swiss-Prot entries with more than one HGNC link. While 52 HGNC links point to more than one UniProtKB/Swiss-Prot entry

For data on the HGNC side unfortunately it misses a SPARQL endpoint so no nice way to do this kind of analytics.

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT
?protein
(GROUP_CONCAT(SUBSTR(STR(?db),30);separator=',') AS ?hgncs)
WHERE
{
?protein a up:Protein .
?protein up:reviewed true .
?protein rdfs:seeAlso ?db .
?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?protein HAVING (COUNT(DISTINCT(?db)) >1)


The inverse query asking for hgnc links present in more than one UniProtKB/Swiss-Prot entry.

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT
?db
(GROUP_CONCAT(SUBSTR(STR(?protein),33);separator=',') AS ?proteins)
WHERE
{
?protein a up:Protein .
?protein up:reviewed true .
?protein rdfs:seeAlso ?db .
?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?db HAVING (COUNT(DISTINCT(?protein)) >1)

0
Entering edit mode

OK, thanks, but the biological/curation issue behind the numbers above is as follows:

It looks like Swiss-Prot have included a large number of proteins (in the order of ~ 500-800) that HGNC are not classifying as protein-coding. The largest categories I think (by manual inspection of matches from segments from the Venn I put on twitter) are endogenous retrovirus, long non-coding RNAs and odour receptor pseudogenes. This is numerically dominant over the relatively small one-to-many (SP < > HGNC in both directions as Jerv shows) which I think they agree on as proteins.