Biostar Beta. Not for public use.
Question: R's GO.db and Python's go_db gives different ancestors, which is correct?
2
Entering edit mode

This is a long post; I am trying to find out why R's GO.db gives different results than Python's go_db https://github.com/endrebak/go_db.

go_db seems to give different results than the R package GO.db.

go_db computes ancestors by looking upwards in the tree; so that C is an ancestor of A if

A is a C, A is part of C or C has part A or any such relation exists transitively.

Since go_db is a dinky script I threw together quickly and GO.db is the seemingly canonical Gene Ontology package I am guessing I am in the wrong, but if anyone can explain why, I would appreciate it. It is probably my usage of GO.db that is incorrect, I'd wager.

To run the scripts below you need to install wide_diapeR and go_db with

pip install -U go_db wide_diaper

and have the R package GO.db installed

‚Äč
from __future__ import print_function
from widediaper import R
from go_db import get_offspring

r = R()
r.load_library("GO.db")
r("cc_offspring = as.data.frame(GOCCOFFSPRING)")
r_cc_offspring = r.get("cc_offspring")
r_cc_offspring.columns = ["Offspring", "Parent"]

py_cc_offspring = get_offspring("CC")

py_go_0001533 = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"]
r_go_0001533 = r_cc_offspring.loc[r_cc_offspring.Offspring == "GO:0001533"]

# Parents in GO.db, but not go_db
set(r_go_0001533.Parent) - set(py_go_0001533.Parent)
# {'GO:0005622',
#  'GO:0005856',
#  'GO:0043226',
#  'GO:0043228',
#  'GO:0043229',
#  'GO:0043232',
#  'GO:0044424'}

# Parents in go_db, but not GO.db
set(py_go_0001533.Parent) - set(r_go_0001533.Parent)
# {'GO:0005886',
#  'GO:0012505',
#  'GO:0016020',
#  'GO:0030313',
#  'GO:0031975',
#  'GO:0071944',
#  'GO:0097653'}

As you can see above, there are great differences in what nodes the two packages consider to be ancestors of "GO:0001533".

Let us look at the go.obo file and see if we can make sense of this. The script below prints out all the ancestors of a node (including the node itself):

# download obo file from http://purl.obolibrary.org/obo/go.obo
obo_file = '/Users/labsenter/.go_db/go.obo' # Use the path to whereever you are storing your .obo file
obo_records = "".join(open(obo_file).readlines()).split("\n\n[Term]")[1:-1]

py_ancestors = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent
py_ancestors.append("GO:0001533")
py_ancestor_search_terms = ["id: " + ancestor for ancestor in py_ancestors]


r_ancestors = list(py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent)
r_ancestors.append("GO:0001533")
r_ancestor_search_terms = ["id: " + ancestor for ancestor in r_ancestors]

def _print_records(search_terms):
    for record in obo_records:
        for search_term in search_terms:
            if search_term in record:
                for line in record.split("\n"):
                    for term in ["id:", "is_a:", "relationship:", "intersection_of:"]:
                        if term in line:
                            print(line)
                print()

print("### Py ancestors\n")
_print_records(py_ancestor_search_terms)

print("### R ancestors\n")
_print_records(r_ancestor_search_terms)

The script above gives the following results:

### Py ancestors

id: GO:0001533
is_a: GO:0005886 ! plasma membrane

id: GO:0005575
alt_id: GO:0008372

id: GO:0005623
is_a: GO:0005575 ! cellular_component

id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery

id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0016020
is_a: GO:0005575 ! cellular_component

id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0031975
is_a: GO:0044464 ! cell part

id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell

id: GO:0071944
is_a: GO:0044464 ! cell part

id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane

### R ancestors

id: GO:0001533
is_a: GO:0005886 ! plasma membrane

id: GO:0005575
alt_id: GO:0008372

id: GO:0005623
is_a: GO:0005575 ! cellular_component

id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery

id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0016020
is_a: GO:0005575 ! cellular_component

id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0031975
is_a: GO:0044464 ! cell part

id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell

id: GO:0071944
is_a: GO:0044464 ! cell part

id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane

Since you can see that GO:0001533 is a GO:0005886 which again is a GO:0016020, GO:0016020should be considered an ancestor of GO:0001533. Unfortunately, that is not the case in GO.db. If anyone knows why, please explain!

ADD COMMENTlink 4.5 years ago Endre Bakken Stovner • 880 • updated 2.0 years ago Biostar 20
1
Entering edit mode

It seems like GO.db only looks at the relationships "is_a" and "part_of" to compute ancestors.

> library(GO.db)
> cc_offspring = as.data.frame(GOCCOFFSPRING)
> colnames(cc_offspring) = c("Offspring", "Parent")
> cc_offspring[cc_offspring["Offspring"] == "GO:0001533",]
       Offspring     Parent
1     GO:0001533 GO:0005886
3404  GO:0001533 GO:0005575
7659  GO:0001533 GO:0016020
9800  GO:0001533 GO:0005623
45693 GO:0001533 GO:0044464
49577 GO:0001533 GO:0071944

If you only use these in go_db, you get the same result:

> from go_db import get_offspring
> cco = get_offspring("CC", ["is_a", "part_of"])
> cco.loc[cco.Offspring == "GO:0001533"]
        Offspring      Parent
722    GO:0001533  GO:0005886
5913   GO:0001533  GO:0016020
6019   GO:0001533  GO:0044464
6136   GO:0001533  GO:0071944
29758  GO:0001533  GO:0005575
30078  GO:0001533  GO:0005623

Will update the docs.

ADD COMMENTlink 4.5 years ago Endre Bakken Stovner • 880

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0