Biostar Beta. Not for public use.
Question: probability of 2 sequences with defined distance in fully sequenced genomes
0
Entering edit mode

I have some transposons identified using their terminal DNA sequences, lets say 10nt each.

These have been identified based on sequence similarity to a set of pre-defined terminal DNA sequences, so discovery is based on match to a profile / HMM / PSSM if you will.

Note that the lengths of these terminals are not strictly 10nt always, but vary a little. Also, the length of the element, i.e. separation between the terminals is quite variable. Just throwing in these details to provide a more complete picture of the problem. Which is:

I want to calculate a statistic that conveys how non-random the combination of these two terminal DNA sequences is, given the whole genome sequence, their lengths, their composition, their length of separation etc.

For example, I envision that when transposon #1 is flanked by terminal DNA sequences that occur way more often in the genome generally speaking, than for transposon #2, this statistic will convey the higher confidence in transposon #2 than for #1.

Question is what should this statistic be, and how would you advice me to go about calculating it?

Some authors have approached this by randomly shuffling the genome and reporting false discovery rate as the ratio of loci discovered in intact versus shuffled genome. I am not too convinced with this approach. I am not against it however, just want advice on a different statistic to describe the chance or alternatively non-random likelihood of these elements.

I am comfortable with Perl coding, and some R coding. If I have multiple options for calculating / reporting a suitable statistic, I request you keep my skill-set in mind. Thanks folks!

0
Entering edit mode

This is what I think you're trying to say:

• You have some profile for the short terminal sequences at the end of some transposon (from Rfam?), which I'll call "profile1" and "profile2".
• You have scanned the genome on both strands (`N = 2*length_genome`) and produced some "hits" for each, i.e. positions for sequences that score above some threshold or each of these profiles.
• You identified `k` hits for profile1 for which there is also a hit for profile2 occurring within `dmax` base-pairs downstream, on the same strand. Let's call each of these a "success".
• You want to know the expected number of successes if the positions of the hits for profile1 and profile2 were completely independent.

Let's say `n1` and `n2` are the number of hits for profile1 and profile2, respectively. For each profile1 hit, there are `dmax` positions that could be a profile2 hit. If we ignore the fact that consecutive samples are not truly independent and that there is some negligible probability of seeing >1 profile2 hits within `dmax` basepairs of a profile1 hit, the expected number of successes is approximated by `m = n1*dmax*p2`, where `p2 = n2/N`.

This means that the FDR is `k/m`.

You can also calculate a p-value using the Poisson distribution with lambda = m. In R, to calculate the probability P(x>=k) = 1 - P(x<=k-1), you can use:

`ppois(k-1, lambda = m, lower.tail = F)`

Entering edit mode
0

THANK YOU! It makes total sense. Just 2 follow-up questions:

1. I wonder if I should count n for each matched sequence in genome, rather than total match counts to the profile. In other words, if profile1 yields 20 different sequence matches, cumulatively with 100 occurrences, then I think you are saying n1=100. But I am asking if n1 should be split into counts for each of 20 sequences matching profile 1 (in my made-up example). Likewise, sub-categories for matches to profile p2. Which quickly becomes factorial, but still.... is that better or worse or something else?

2. Is there any way (or need) to factor in variable scores of the matching sequences identified by these profiles?

Anand Rao
• 210
Entering edit mode
0

I don't quite understand what #1 would tell you, so you'd have to come up with a good reason why the particular sequences are important. What's the null hypothesis? It's already exceedingly rare to observe any pair of 10bp sequences in close proximity in the genome, but the relevant question, I presume, is whether the number of hits is surprising, not whether their particular sequences are rare.

As for #2, you could get specific FDRs for any pair of variable scores by changing `n1` and `n2` above to the number of hits for each profile that have a score greater than or equal to the scores of a given pair, then recalculating the E-vale (m) and the number of pairs in the new, smaller set (k). This would tell you what the E-value would be if you used the specific scores for that pair as a threshold.