Question

Tips to build a protein network

4

Entering edit mode

7.1 years ago

benoahb ▴ 40

Ok so I have been lurking for days to find a way to analyse my data, often ending up here... but I haven't found a satisfying answer yet. Lots of answers poping-up are 5+ years old. I am still part of the newbie gang. So here I come with a bunch of questions.

I have a few sets of ±100 proteins identified by MS as up/down-regulated under a few conditions. So I am looking for an easy workflow to obtain a scientifically accurate network, stringent enough to stay away from false positives although not TOO stringent so I can actually get some informations from my diverse network. As well as a pretty figure at the end to show off with :) I have Uniprot IDs, a fold change and a p-value. Rather looking for free apps (OS X) or web-based tools.

My idea is to end up on cytoscape to finalise some neat protein networks.

Based on direct physical interactions > common pathway/function > genetic interactions (it seems to dilute the essential information?). Rather experimental than predicted.
With a color code for up/down-regulated, scaled according to my fold change.
Maybe a shape code for molecular function/protein class.
A node size scaled to the number of edges (already mastering that one)
With also a clear identification of functional groups (biological process/pathway).
And several type of connections according to the source.

To give you an example of what I have in mind: http://www.nature.com/nbt/journal/v30/n10/full/nbt.2356.html (They use MetaCore which I don't have access to...)

First of all, Uniprot IDs... Somehow there's always some proteins that are left behind as the IDs are not being accurately recognised. I have checked and corrected them if necessary, my (human) IDs are up to date on Uniprot.org. How come? Do I have to just live with that?

Ingenuity, iPathway, Reactome don't seem to answer my needs.

Best tool to build my network:

STRING, I was quite happy with it at first but it smells like a trap. How to avoid false positives? Sources? Minimum score (0,7?), Max number of interactors to show (1st, 2d shell?)
GeneMANIA is quite seducing, all pretty and simple. I can actually select the kind of sources I am looking for... but way more hits than STRING. Lacking a score threshold? No control over it..
ConsensusPathDB. For what I've read, it seems to be THE ONE but it crashes with over 9000 listed interactions on my computer and for what I've seen it's pretty basic?
Cytoscape in-Apps: String, Genemania, Reactome... none seems to work as good as the web-based tools.

Cytoscape:

How do I impute any external data (Fold change, GO stuff, ...) to my network coming from STRING or GeneMANIA let say? I have a file for the network and a file with FC values etc...
I've read about GOlorize, BiNGO, iRegulon, ClusterViz, EnrichmentMap... where to go?

I don't feel like I am asking the moon. But we're in 2017 and I can't be the first one with that kind of request. How come it is so difficult? diluted within so many options? and there's no easy way to go? I don't mind "getting my hands dirty" but it seems endless here...

So, any solid workflow to follow?

proteomic network workflow • 4.5k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 7.1 years ago by benoahb ▴ 40

0

Entering edit mode

First, regarding the ID problem, you should be aware that not all resources are synchronized or even use the same reference genome annotations, different resources annotate genome differently and annotations do change over time. So the first thing to do is decide which reference genome you're going to use and stick with it. Any gene/protein not in that reference doesn't exist for your purpose, i.e. if an ID doesn't map to something in the reference you can ignore it. The second step is to understand the data in the different databases and how it has been derived. Then you need to figure out what it is you want to do/show in relation to the biology you're interested in. Should the edges of the graph represent documented physical interactions or more generically functional relationships ? Finally if you want to use Cytoscape, start with reading the documentation. It will give you ideas on how to import data.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you for your answer!

First. That's what I was afraid of. I have much more "miss" with GeneMANIA than with STRING. So one point for STRING.

Second. Your comment is quite pertinent as the sources of STRING are not as clear to me as GeneMANIA ones. One point for GeneMANIA I would like 2 types of edge. Bold and shiny physical interactions on one side. Thin and fade functional relationships.

Finally, I'm reading and learning... there's such a huge amount of documentation... but got it.

ADD REPLY • link 7.1 years ago by benoahb ▴ 40

0

Entering edit mode

Regarding STRING, I would pick the following parameters Sources: text-mining, experiments, databases, co-expression pValue: 0,7 1st shell: query proteins only 2d shell: 5-10 proteins... and kick out some if similar protein in a same close network (ie: polymerase subunits)

Any feedback? What do you think about the clustering fuctions (Kmeans, MCL?)

ADD REPLY • link 7.1 years ago by benoahb ▴ 40

0

Entering edit mode

Please do not create answers to reply to comment, use the 'add reply' button for this. This keeps the discussion organized.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Depending on what you want the edges of the graph to mean and on what you want to do with the graph, you should select different things. Also the functionally relevant information is concentrated in different data types for different organisms. For example, for human, most of the information on biological processes is in the physical protein interaction graph whereas for some model organisms it is in the genetic interactions. In my experience, co-expression data is useless for inferring gene function. The clustering algorithm to choose depends also on your data representation e.g. standard k-means operates on vectors, not on graphs. Given some query proteins, extracting a relevant subgraph from the whole graph is not an easy problem but I have found that the limited k-walk approach works well in many cases (I have made it available in my Graph.pm perl module)

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

If you decide to use STRING for this purpose, which should depends on what kind of network you are interested in, take a look at the new STRING app for Cytoscape. It makes importing a STRING network for a proteomics dataset into Cytoscape much less painful.

ADD REPLY • link 7.1 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I have tried the app but it doesn't work as easily as the web app for me. Also, having a file before I import the network in Cytoscape allows me to add some parameters!

ADD REPLY • link 7.1 years ago by benoahb ▴ 40

1

Entering edit mode

There is always a tradeoff between "easy" and "features". Sure, the web interface of STRING is the easiest way to access STRING. However, part of making it so easy is to leave out features. So if you need more features, e.g. the ability to map your own data onto a network, you have to use an interface that is not quite as easy. I'm not sure what you mean by "add some parameters" - if it is external parameters about the proteins, the normal workflow would be to use the "import table" functionality of Cytoscape.

ADD REPLY • link 7.1 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I don't know if someone already recommended this, but I'm in the same problem. I can't even properly install the programs I want... but theres a pretty awesome and easy way to make a SSN:

The "EFI - Enzyme Similarity Tool" http://efi.igb.illinois.edu/efi-est/

This is a web-tool that allows you to either upload your own fasta sequences, or retrieve them by UniProt IDs or NCBI IDs. and offers a bunch of other options for your edge and node attributes. You can even make a Genome Neighbourhood Network, for enzyme pathways or whatever. You can then modify your edge thresholds to your liking, and some other options for your network to look good. Hope this helps someone!

ADD REPLY • link 6.2 years ago by andplaia • 0

score 1 · Answer 1 · 2017-04-07

You can find a rather outdated list and comparison of tools for interaction data integration & analysis on the Table 2 of this article, which presents BIANA, a tool for biological interaction and network analysis. (Disclaimer: I contributed to the development of BIANA.)

There is a web server version of BIANA, where you can query using Uniprot IDs. You can specify which types of interactions you want to include while building your network such as including only physical protein interactions produced by certain experimental techniques.

Omictools also has several relevant categories (see ppis and ppi-prediction) where you have a more up-to-date list of available tools.

Hope this helps.

score 1 · Answer 2 · 2017-04-07

This is a great question - and 1 that I have personally tackled recently

Bear in mind that you are limited by the knowledge and reliability of the databases you are getting your interaction data from - the great thing about stringDB is that you can easily adjust both the stringency and source of the interactions (not to mention that it is curated by some pretty awesome groups in the STRING consortium - if they dont know what they are doing...then god help the rest of us.)

Ive done something similar identifying motifs and a regulatory network in a tissue we are studying beginning with RNA-seq data.

Pipeline:

1) In-house tissue specificity analysis to get a tissue specific gene set
2) Input gene set to stringDB
3) import StringDB results to i-regulon for motif analysis and network construction

Alternatives:

1) you can choose to use cytoscape/i-regulon on your computer locally or use iCis-target - the online tool from the same authors

2) If your need is high throughput (i.e. you are creating a pipeline so can't use any GUIs) then use RCis-target - again from the same authors - this is the same tool but written in an R script that can be added to your pipeline.

However, the online version of iCis-target (same authors as i-regulon) might meet your needs if this is a one off analysis.