Biostar Beta. Not for public use.
MSigDB for Multiple Organisms in a Tidy Data Format
Entering edit mode
22 months ago
igor 7.7k
United States

There are a lot of R-based pathway analysis tools. There are also supporting data packages for the actual pathways from GO, KEGG, or Reactome. However, support for Molecular Signatures Database (MSigDB) from the GSEA within the R ecosystem is fairly limited. You have to import GMT files, re-structure the resulting objects, and potentially convert genes from human to other species. All of these are relatively trivial, but it adds up. As Hadley Wickham said: "you should consider writing a function whenever you've copied and pasted a block of code more than twice". Functions are easy to share, but datasets are trickier. So I made an R package that includes both: msigdbr.

With msigdbr, you can retrieve MSigDB gene sets:

  • in an R-friendly format (a "tidy" data frame with one gene per row that work well with the tidyverse packages)
  • as both gene symbols and Entrez Gene IDs
  • for multiple frequently studied organisms (not everyone works with exclusively human data and it's easy run into problems retrieving gene orthologs)
  • that can be used and shared in a single script (without requiring additional files or an active internet connection)

There are a few other similar existing solutions, but I couldn't find any that addressed all of my pain points. I also just wanted to make an R package and this seemed like a good idea that was simple enough to start with. This probably doesn't need to be explicitly stated, but any feedback is welcome, which is why it's good to post here.

Entering edit mode

Hey Igor great package!

I was hoping to use it for some fly analysis but had two quick questions about this if you have time; in the vignette you have:

'Can’t I just convert human genes to any organism myself? Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.'

Could you maybe clarify a little more how you are making these connections in your package?

I see at the bottom:

'Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.'

So that suggests to me that you are taking the human genes and finding their most likely orthologs (in each given species) based on the overlap between multiple databases. What happens in cases where the closest homolog isn't that "great" of a match (even with overlaps across different DBs)? Is there any specific scoring threshold (AA similarity etc...) or is it just based on database annotations?

Just as an example looking at the major ping-pong piRNA Drosophila protein genes we have: AGO3 and aub

The homologs for humans from the package are:


aub: PIWIL1

The current literature seems to suggest that AGO3 is slightly more similar PIWIL2. Obviously I wouldn't expect someone to go in and manually check each entry - furthermore for computational reproducibility I think it makes sense to keep the standard association methods universal. So please don't take this as criticism - just wondering how this would affect analysis.

Also how has this package been working for you? Do researchers/collaborators seem to agree with the results in general?

EDIT: Also quick suggestion. It might be nice to include the human-readable "brief description" for each of the annotation sets. Maybe an extra $description column in the m_df object.


CSR_EARLY_UP.V1_UP == Genes up-regulated in early serum response of CRL 2091 cells (foreskin fibroblasts).

It was pretty easy for me to do myself by downloading the card from msigdb but just thought it might be good to already include in a "all-in-one" package such as this one - especially if less "computationally experienced" people want to use it.

Entering edit mode

Thanks for trying the package!

I only keep the homologs that are listed in multiple databases and keep the one that is confirmed by most (some genes can return dozens of homologs across different resources). I don't expect every gene to be correct, but using gene sets should hopefully balance that out.

I work mostly with mammalian genomes, so the homologs are usually very obvious and the pathways should be similar. I can't really comment on more distant species. For some context, some of the MSigDB gene sets are based on zebra fish data, so cross-species comparisons are acceptable to some degree.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1