Question

Dissimilarity measure for computing similarity between categorical variables in R

0

Entering edit mode

6.1 years ago

svlachavas ▴ 790

Dear Community,

based on a data table in R (txt file), i would like to implement a "distance-metric"/approach, in order to compute the dissimilarity between categorical variables, which in this case are drugs. My ultimate goal, is based on the gene symbols that these drugs are associated (total overlap of common.up and common.down genes below), to find the most "dissimilar" pairs of drugs, that have the smallest percentage of overlap.

A snapshot of my data table for the first two rows is the following:

head(drugs,2)
      experiment_id     Score cell_line   chemical hours
19999        LJP005 0.2688172      HT29  PD-184352   24H
19980        LJP005 0.2365591      HT29 trametinib   24H
                                                                                                                              common.up
19999 c(NOP56, PAICS, COL1A1, COL3A1, DKC1, TPX2, BOP1, IARS, MCM2, MCM4, MCM7, MYC, NME1, PPAT, NAT10, NHP2, AURKA, CAD, EEF1E1, CDK1)
19980      c(EMG1, NOP56, PAICS, DKC1, TPX2, BOP1, HPRT1, IARS, MCM2, MCM4, MCM7, MYC, NME1, PPAT, NHP2, RAN, AURKA, CAD, EEF1E1, CDK1)
                            common.down dosage..uM.
19999 c(TXNIP, FGFR2, ANK3, HMGCL, CAT)       10.00
19980                   c(FGFR2, HMGCL)        1.11

Any ideas or suggestions about which metrics/approaches would be robust for my approach ?

R clustering similarity categorical features • 2.9k views

ADD COMMENT • link 6.1 years ago by svlachavas ▴ 790

1

Entering edit mode

Probably Jaccard similarity.

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Dear h.mon,

thank you for your answer.However, i have already used the Jaccard coefficient to rank these "resulted experiments" (Score column above), in a similar way of performing an overepresentation analysis described in a previous post (https://www.biostars.org/p/299820/#300014).

So, i would like to use a different method/approach. For example, cosine similarity would fit in your opinion for my goal ?

ADD REPLY • link 6.1 years ago by svlachavas ▴ 790

2

Entering edit mode

I don't know a proper answer for your question, for two reasons:

1) I am not an expert on similarity measures,

2) you do not say what you think is important and should be captured by the similarity measure - that is, you do not define similarity for your problem.

While you have no power to improve my knowledge on similarity indexes, you certainly do have power to think about the similarity you want to capture. Do non-common genes matter? Or only shared genes matter? Or even instead of looking at a subset of the genes (applying a cut-off and discarding "non-significant" genes), why not measure the similarity of changes in expression of the whole set of genes?

As you described the problem, I think Jaccard is the most appropriate. Why are you unsatisfied with it?

As a side-note, if you already used Jaccard similarity and are interested in alternatives / improvements, state that on your question and avoid wasting time - yours, and ours (the people potentially writing answers).

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Dear h.mon,

thank you for your answer, and please excuse me if I was not clear or providing enough information about my approach. So, two quick comments on this matter:

1) The initial Jaccard similarity mentioned, is to generally rank the gene-sets from a drug-gene base (L1000), with my input DE genes, like an overepresentation analysis

2) My next goal, is based on these ranked experiments-drugs, is to identify the "most" disimilar pairs of drugs/experiment, that have the less amount of identified genes from my initial signature--that's why i also asked for alternative metrics.

Thus, in your opinion, using for this context the Jaccard coefficient (or another similarity measure), would be enough to find the most disimilar pairs of experiments from above ? based on their total annotated genes ? (both up and down) ?

ADD REPLY • link 6.1 years ago by svlachavas ▴ 790

2

Entering edit mode

Given your question:

My ultimate goal, is based on the gene symbols that these drugs are associated (total overlap of common.up and common.down genes below), to find the most "dissimilar" pairs of drugs, that have the smallest percentage of overlap.

then the answer by @h.mon is relevant. As a general rule, you should choose a measure that captures relevant properties of similarity between the items. If only the percentage of overlap is relevant (i.e. you want to ignore the sizes of the sets), then use it as similarity measure. There are plenty of other measures for measuring similarity between sets (aka binary similarity measures). Check the R package proxy or this survey of binary similarity measures. If this is not what you want then please clarify what your goal is. It looks like another case of the XY problem.

ADD REPLY • link 6.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Dear Jean-Karim,

thank you also for your answer and suggestions- i will search the R package proxy and inspect various measures

ADD REPLY • link 6.1 years ago by svlachavas ▴ 790