Question

Best practice for handling BioGRID PPI interactor aliases?

0

Entering edit mode

6.1 years ago

sam237337 ▴ 70

I need to use the PPI data from BioGRID, given by the OFFICIAL_SYMBOL_A and OFFICIAL_SYMBOL_B columns in the source files. In this project's most recent publication, it is reported that "BioGRID interaction annotation is based on Entrez Gene identifiers for genes and proteins."

Beyond the two columns of data cited above, each source file also contains two corresponding columns of aliases for each main interactor, which are described in the documentation as 'lists of common names' for these genes. However, I haven't been able to determine on which platform(s) these aliases are defined, and am wondering whether I should include these aliases in my PPI analyses, or whether using the single interactor names in Entrez format will be sufficient to effectively capture these interactions.

If anyone understands whether it is good practice to include all of these aliases for each interactor (and whether their platform is Entrez and/or other ontologies), or if using the single Entrez name for each interactor will suffice, I appreciate your insights.

BioGRID PPI aliases disambiguation • 1.7k views

ADD COMMENT • link 6.1 years ago by sam237337 ▴ 70

0

Entering edit mode

I don't really understand what you're trying to do. BioGRID reports pairwise interactions of various types at the gene level and genes are defined there by their Entrez Gene identifiers. If you're trying to find whether two genes interact in BioGRID, all you need is to find their Entrez Gene IDs. Aliases are simply names or other types of identifiers that BioGRID has mapped to Entrez Gene IDs. Whether you want to rely on them to identify interactions is up to you. It depends on what you call a gene. Some people consider as a gene all entities that share the same name. I personally work using the Ensembl definition of a gene, roughly a locus that produce related transcripts/proteins and in this case, names can be misleading as they can be shared by multiple genes. What is a gene in Entrez has never been clear to me. It seems to be defined as whatever is annotated as gene in RefSeq but how this annotation is done is unclear to me.

ADD REPLY • link 6.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for sharing your input, Jean-Karim. My plan is to report PPIs using Entrez gene symbols, and I was just wondering whether [some or all of] the aliases provided by BioGRID might also be defined within the Entrez platform, and/or whether those aliases might originate from other sources. You had advised, "Whether you want to rely on [BioGRID aliases] to identify interactions is up to you." I am hoping that those with experience in PPI analysis might advise whether it is standard practice when using the BioGRID data to (a) include such aliases, or (b) just rely on the "OFFICIAL_SYMBOL" data for each interaction. I appreciate your recommendation to use Ensembl definitions of genes, although I'm not sure if that is actionable in this case of needing to use BioGRID as my resource.

ADD REPLY • link 6.1 years ago by sam237337 ▴ 70

0

Entering edit mode

When I use BioGRID, I use the official gene symbol to identify the corresponding gene in Ensembl. For human, the official gene symbol is defined by the HGNC and so should match between resources if they used the same HGNC version. Again I don't understand what you mean by "include the aliases", include where ? for what ? The way to work is to pick a reference genome/gene set and map all data to it. Mixing gene definitions from different resources is bound to create trouble, not all genes map one to one between resources or even exist in all resources. So if you work with Entrez genes as your reference gene set then map everything to Entrez gene IDs. If your reference gene set is the set of genes with HGNC symbols, then map all your data to HGNC gene symbols. If you do the later, you're basically calling a gene all genomic loci that share the same symbol, for example two genes in Ensembl could result from a duplication and be annotated with the same symbol and you would then consider them as one gene. What makes sense depends on the context. For some bioinformatics work, it may make more sense to have gene definition tied to a genomic locus and at other times, it may make more sense to have a name-based definition. My approach is to use Ensembl genes and summarize biological conclusions at the gene symbol level (i.e. considering duplications as one gene).

ADD REPLY • link 6.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your clarifications, Jean-Karim; that is informative, and I'll consider this advice when deciding how to use the BioGRID data.

ADD REPLY • link 6.1 years ago by sam237337 ▴ 70