Question

What is the control level of activation in Paradigm?

1

Entering edit mode

9.1 years ago

fbrundu ▴ 350

I am trying to use Paradigm for pathway analysis on mRNA expression data. In

Sample-Specific Cancer Pathway Analysis Using PARADIGM

it is written that

Each entity can take on one of three states corresponding to activated, nominal, or deactivated relative to a control level (e.g. as measured in normal tissue) and encoded as 1, 0, or -1 respectively. The states may be interpreted differently depending on the type of entity (e.g. gene, protein, etc). For example, an activated mRNA entity represents overexpression, while an activated genomic copy entity represents more than two copies are present in the genome

What is the control level here? The median of the variable or the median over all variables?

Moreover, in the online help it is written that

Currently underlying PARADIGM is a discrete, three-state model. That means that each value in your evidence file will be converted internally to one of three states: down, no-change, or up. The conversion happens when the evidence file is loaded, and the discretization bounds are set using the "disc" parameter in the config form. These descritization bounds refer to the percentages of data that will fall within each bin, so the default option sets the lowest 33% of the data to "down" and the highest 33% of the data to "up".

I am familiar with the discrete, three state model presented. However, I cannot understand if the discretization is computed along each variable separately or overall the data. I.e., if each variable dynamic range is discretized in 3 bins (down, no-change, up) or if the discretization splits the overall dynamic range in 3 bins (so one variable dynamic range can be composed only by elements from a single bin).

Thanks

paradigm gene-expression pathway mrna • 2.1k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 9.1 years ago by fbrundu ▴ 350

Ram · Answer 1 · 2015-06-01

I also want to try out Paradigm but am confused by this same issue. You posted your question 11 weeks ago; are you any closer to deciphering how Paradigm's discretization works? Here are a few quotes from the original paper (Vaske et al. 2010) that might shed light on this issue:

These variables represent the differential state of each entity in comparison with a 'control' or normal level rather than the direct concentrations of the molecular entities. This representation allows us to model many high-throughput datasets, such as gene expression detected with DNA microarrays, that often either directly measure the differential state of a gene or convert direct measurements to measurements relative to matched controls.

This paragraph suggests to me that Paradigm assumes that the data in the input expression matrix have already been processed to represent a differential state (either a fold-change or a P-value). If this is the case, then I think it definitely represents a weakness in the method since suitable controls (matched or not) do not exist for patient gene expression data in cancer. Rather, in cBioportal's presentation of TCGA expression data, for example, a Z-score is reported for each gene-sample combination for a given gene by standardizing to the mean value of that gene, which I think is probably the best that can be done.

There are multiple allusions in the original paper and elsewhere that support the notion that Paradigm is discretizing over all the variables. In the paper they analyze a breast cancer microarray dataset (Naderi et al.) for which no matched normal expression is available, and report the following normalization approach in the methods:

All data were non-parametrically normalized using a ranking procedure including all sample-probe values and each gene-sample pair was given a signed P-value based on the rank. A maximal P-value of 0.05 was used to determine gene-samples pairs that were significantly altered.

From the above, my guess is that the P-values are used to discretize the data into activated (signed P < 0.05), deactivated (signed P > -0.05), and unchanged (otherwise). Of course this is problematic since genes with high ranks are not necessarily activated but just highly expressed. They also analyze a glioblastoma cohort from TCGA, in which they take advantage of a small number of "normal" samples from normal tissue adjacent to the tumour:

The glioblastoma data from TCGA (2008) was obtained from the TCGA data portal providing gene expression for 230 patient samples and 10 adjacent normal tissues on the Affymetrix U133A platform. The probes for the patient samples were normalized to the normal tissue by subtracting the median normal value of each probe. ... [The dataset was] non-parametrically normalized using the same procedure as the breast cancer data

Trying to obtain differential information for a tumour sample by pooling a set of adjacent normals is flawed (or at least controversial), since the cell-type composition of surrounding normal tissue will differ strongly from the tumour itself, which is thought to be composed of largely undifferentiated dividing progenitors present at low concentration in normal tissue. Hence, the change in expression between tumour and normal will be confounded by the differing expression profiles of stem cells and differentiated cells.

Finally, there is this pretty clear statement on the Paradigm help page:

Internally the system will attempt to remove platform biases by performing a non-parametric normalization technique. This is accomplished by rank-ordering the entire datafile and assign new values to each point of rank / total. That would make the lowest value in a datafile with 10 samples and 10 genes 1/100, and the highest value 100/100. The understanding of the discrete values each gene/sample is going to be assigned is useful for interpreting the results, and changing the discrete cutoffs can cause the pathway results to vary widely.

So it appears that the whole file is used for discretization, which means that unless the data have already been processed to represent differential states (a dicey proposition in patient tumour samples), then our inference won't be based on "activated" and "deactivated" genes but rather highly and lowly expressed genes. I would think this would limit the overall ability of the algorithm to detect consistent perturbations across pathways and between patients.