Question

Selecting Gene Subsets Such That No Two Genes Occur In The Same Set More Than "T" Times

0

Entering edit mode

10.9 years ago

learnerforever ▴ 520

In one of my gene sets, I am interested in selecting gene subsets such that no two genes occur in the same set more than "t" times. Mathematically, given n objects and selecting k objects without replacement, how many combinations exist such that no two objects are in the same group more than "t" times?

E.g, If we have five genes A, B, C, D, E and pick 5 choose 3 = 10 combinations with t =2: But gene combination A B is present three times and hence out of three possibilities

A B C

A B D

A B E

I will select only 2 (=t) combinations randomly.

Is there a closed form solution for finding out how many combinations exist and listing them out for say n = 40, k=5, and t=3?

gene • 2.0k views

ADD COMMENT • link updated 10.9 years ago by KCC ★ 4.1k • written 10.9 years ago by learnerforever ▴ 520

score 0 · Answer 1 · 2013-05-13

0

Entering edit mode

10.9 years ago

Leandro Lima ▴ 970

I think that is C(40,5) - C(38,3) + t

ADD COMMENT • link 10.9 years ago by Leandro Lima ▴ 970

0

Entering edit mode

Thanks, but I would also appreciate some explanation of your solution. How did you arrive at this?

ADD REPLY • link 10.9 years ago by learnerforever ▴ 520

0

Entering edit mode

Sorry. I thought a bit more and realized that it is not that simple. Do you really need a closed form?

Maybe it is easier to simulate and count.

ADD REPLY • link 10.9 years ago by Leandro Lima ▴ 970

0

Entering edit mode

My previous formula is correct only if you have a fixed pair.

ADD REPLY • link 10.9 years ago by Leandro Lima ▴ 970

score 0 · Answer 2 · 2013-05-13

I think mathematically, one tackles a problem like this in this way: Expand this (1+x+x^2+...+x^t)^n and then looking for the x^k component in the final expanded product. Let me know if you need more explanation than the following. (It's a little challenging to express mathematics without LaTeX.)

Essentially, each copy of (1+x+ ... + x^t) represents each gene. Let me explain the encoding. In your example with 5 genes:

If we have five genes A, B, C, D, E and pick 5 choose 3 = 10 combinations with t=2".

There would be five copies of (1+x+x^2). The selection "ABD" is encoded by picking x in the first copy of (1+x+x^2), then x in the next, then 1, then x, then 1. Note that xx1x1 = x^3.

You can get a general formula by expressing your multinomials in terms of combinations using the closed forms found on this page: http://en.wikipedia.org/wiki/Multinomial_theorem

I recently found an approach similar to what I was discussing in this question, that's better explained.

Note that they use Wolfram alpha to get the coefficient they are looking for. You could probably use either a recursive algorithm or dynamic programming to design a program that evaluates a minimal number of products in order to get the coefficient you are interested in.