Biostar Beta. Not for public use.
Question: problem with KEGG pathway gene extraction
3
Entering edit mode

Hi,

I have been struggling to extract the set of genes related to each pathway from KEGG. I read multiple topics on biostar on how to do this but apparently there is something wrong. I downloaded the bret hierarchy file in order to get genes on the pathway level that I want. However, there seem to be multiple methods which some of them are working and the others are not.

In this topic, (https://www.biostars.org/p/69336/) a nice method has been mentioned by Pierre on how to do this using the following url: http://togows.dbcls.jp/entry/pathway/hsa05200/genes.json. However, when I try this for other pathways, the json list is emtpy (e.g., http://togows.dbcls.jp/entry/pathway/hsa01200/genes.json).

Another method which I found in this answer (https://www.biostars.org/p/71737/), was using the REST API. However, also for some of the pathways, there are no genes (e.g., http://rest.kegg.jp/get/hsa01200).

Surprisingly, if I take the same pathway, and I go to: http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+path:hsa01200 I'm able to find the list of genes that I'm looking for.

Can someone tell me what is the difference between these 3 methods and what is the recommended method to do this.

ADD COMMENTlink 4.3 years ago Abdullah • 100 • updated 4.3 years ago Kamil ♦ 1.9k
Entering edit mode
1

The first method only has 401 lines for hsa05200, the latter method has close to 2000. My library uses the latter method. See https://github.com/endrebak/kg

My guess is that the first method is from the outdated kegg (it went unfunded for a while and the public data was not updated AFAIK), and the second is from the second run of KEGG, 2014-2016.

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
3
Entering edit mode

I'm able to get a list of genes by using the kg tool as shown below. Perhaps this is enough to meet your needs. I would also like to be able to specify the type of gene identifier, such as Entrez ID or Ensembl ID (link to issue), but this has not been implemented yet as far as I can tell.

Command:

kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200)

Output: (trimmed)

Cache path is: /Users/slowikow/.kegg/ (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG gene to external gene map. (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG path to gene map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Get KEGG pathway to definition map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Connect KEGG gene map and KEGG pathway map. (Time:  Thu, 29 Oct 2015 10:10:26 )
/Users/slowikow/anaconda/lib/python2.7/site-packages/ebs/merge_cols.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  in_df = in_df.sort(pre_merge_sort_order_col)
Attaching pathway definitions. (Time:  Thu, 29 Oct 2015 10:10:26 )
01200   ME3     Carbon metabolism
01200   PGAMA   Carbon metabolism
01200   PGAM-B  Carbon metabolism
01200   HEL-S-35        Carbon metabolism
01200   PGAM1   Carbon metabolism
01200   PFKF    Carbon metabolism
ADD COMMENTlink 4.3 years ago Kamil ♦ 1.9k
Entering edit mode
1
kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200) | biomartian -d hsap -i external_gene_name -o entrezgene --noheader -c 1 -

0    1    2    entrezgene
01200    ME3    Carbon metabolism    10873
01200    PGAMA    Carbon metabolism
01200    PGAM-B    Carbon metabolism
01200    HEL-S-35    Carbon metabolism
01200    PGAM1    Carbon metabolism    5223
01200    PFKF    Carbon metabolism
01200    PFK-P    Carbon metabolism

As you can see, not all external_gene_names have associated entrezgene ids.

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

Unfortunately, the truth is more complicated than one might expect. Indeed, biomartian does not retrieve the Entrez IDs for all genes, but the genes do in fact have associated Entrez IDs.

PGAMA is 5223, but it is also known as PGAM1; PGAM-B; HEL-S-35
PGAM-B is 441531, but is also known as PGAM4; PGAM1; PGAM3; dJ1000K24.1
HEL-S-35 is probably 5223
PFKF is 5214, but is also known as PFKP; PFK-C; PFK-P; ATP-PFK
PFK-P is probably 5214

I prefer to use identifiers like Entrez or Ensembl for data analysis to avoid the ambiguity of gene names. On the other hand, I use HGNC names for presentation. Endre, you might like to visit MyGene.info if you have not already. I think the developers have done an admirable job of dealing with name ambiguity.

ADD REPLYlink 4.3 years ago
Kamil
♦ 1.9k
Entering edit mode
0

Thanks for the info, I should switch to a different gene name format by default then. I think KEGGREST offers them, but the service is devoid of documentation.

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

Thank you. However, I'm not able to run this command from within python, probably it is a silly error but i'm not able to find what is it.

I'm using:

pathwayID = 01200

cmd= **" /usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo {})"**.format(pathwayID)

tmp = os.popen(cmd).read()

and I'm getting this error:

sh: -c: line 0: syntax error near unexpected token `('

sh: -c: line 0: ` **/usr/packages/kg-master/bin/kg** --mergecol=0 --noheader --genes --definition --species=hsa  <(echo 05200)'
ADD REPLYlink 4.3 years ago
Abdullah
• 100
Entering edit mode
0

Why do you want to run it from within Python? Anyways, check out the kg/lib.py methods - I'll write some docs for them eventually.

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

thanks for your suggestion, I was looking for a fast way to run it. I managed to do this using:

cmd= **" /usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo '{}')"**.format(pathwayID)

process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=subprocess.PIPE, executable= **" /bin/bash"**)
out, err = process.communicate()
ADD REPLYlink 4.3 years ago
Abdullah
• 100
Entering edit mode
0

I have updated kg and added an example of how to use it from within python: https://github.com/endrebak/kg

pip install kg==0.0.7

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

Thanks. When I run the API example, I get this error:

>>> df = get_kegg_data("rno", pathway_definitions)

>>> Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 483, in __call__

    return self._cached_call(args, kwargs)[0]

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 430, in _cached_call

    out, metadata = self.call(*args, **kwargs)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 675, in call

    output = self.func(*args, **kwargs)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kegg/lib.py", line 32, in get_kegg_data

    "kegg_pathway")

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 35, in attach_data

    merged_df = _post_merge_cleanup(merged_df, new_col_name)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 74, in _post_merge_cleanup

    in_df = in_df.sort_values_by(pre_merge_sort_order_col)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 2246, in __getattr__

    (type(self).__name__, name))

AttributeError: 'DataFrame' object has no attribute 'sort_values_by'
ADD REPLYlink 4.3 years ago
Abdullah
• 100
Entering edit mode
0

I'll fix it later today, thanks.

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

Now it should work. pip install kg==0.0.8

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

Thanks. The API example is working now. However, the command line tool still throw this error. It seems that the git is not updated?

ADD REPLYlink 4.3 years ago
Abdullah
• 100
Entering edit mode
0

That is something on your end. The CLI and API uses the same code. It probably stems from you having multiple versions of Python installed and different versions of the library installed for each. If you encounter more problems please tell me at https://github.com/endrebak/kg/issues

ADD REPLYlink 4.3 years ago
Endre Bakken Stovner
• 880
Entering edit mode
0

The current version in git seems to be working now. Thanks ;)

ADD REPLYlink 4.3 years ago
Abdullah
• 100

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0