Question

Rating The Affinity Of A Transciption Factor For A Region Of Dna

1

Entering edit mode

10.8 years ago

KCC ★ 4.1k

I am looking for a way to generate an estimation of the level of affinity that a transcription factor would have for a particular sequence of DNA on a genome-wide level. The raw data I have is a pwm or position weight matrix. I am seeking some kind of output across the whole genome in wiggle or bedgraph. Assuming the PWM was for a k bp motif, I would want to wiggle file that gave me the affinity for each k-mer centered on position k for each position in the genome. Is there are tool for this kind of task?

I found this suggestion for rolling my own solution on biology stackexchange, http://bioinformatica.upf.edu/T12/MakeProfile.html

Basically, I use the PWM as a lookup table and sum the values I look up. (It shouldn't take more than a few minutes to write this up in python.) Although, I would like to avoid reinventing the wheel if I don't have to.

I would also like to know how to score a region such as a peak or a promoter region in terms of the level of affinity a transcription factor would have for that region based on having a known PWM. Should one just take a mean of the scores computed (mode, median, arithmetic mean or geometric mean)?

motif • 3.8k views

ADD COMMENT • link updated 10.8 years ago by Pierre Lindenbaum 161k • written 10.8 years ago by KCC ★ 4.1k

0

Entering edit mode

It should be noted that the score that a a PSSM or PWM returns for a given DNA sequence is only very rarely a readout of affinity to that region.

Usually it is simply a metric for how well the DNA seq agrees with the model derived from the known instances of binding sequences. Even if the score is PERFECT, it has very little predictive power for whether your TF will find a nice tightly binding home there. The only way that the type of analysis you are describing would have a reasonable chance of approaching usefulness at genome-scale prediction is if you incorporate more information like conservation of a high scoring seq in multiple orthologous regions, or scoring 2 or more TFs that are known to cooperate as a module (even that is asking a LOT).

If you, can define a set of foreground and background regions, you could use hypergeometric statistics to derive a p-value on the enrichment of hits in your forgoround regions as compared to all regions.

Sorry to be a downer but I think you need to rethink your approach to this. Or at least better describe your goals here. There is no reason (or as far as I can tell even ANY utility) to plotting this out on a genomic scale. Eyeballing this stuff is an exercise in pareidolia.

ADD REPLY • link 10.8 years ago by wadunn83 ▴ 90

0

Entering edit mode

I do not completely agree here. An attractive feature of PWMs is that they can produce scores that are correlated with the energetic binding affinity between a protein and a DNA sequence (Probabilistic Code for DNA Recognition by Proteins of the EGR Family. Journal of Molecular Biology; SAMIE: Statistical algorithm for modeling interaction energies. In: Pacific Symposium on Biocomputing; Additivity in protein-DNA interactions: how good an approximation is it? Nucleic acids research). I have been working with PWMs a lot and they usally correlate extremely well with experimental binding affinity data coming from MITOMI or PBM for instance (Evaluation of methods for modeling transcription factor sequence specificity. Nature biotechnology).

If you scan the genome with a PWM, you may expect a high level of false positive in the sense that you might capture sequences that could be bound in vitro but will not in vivo given epigenetic modifications, open chromatin, etc.

ADD REPLY • link 10.8 years ago by Anthony Mathelier ▴ 910

0

Entering edit mode

I am afraid that your examples are almost exactly what I was talking about. Note that I said rarely, not never.

I will admit that I have not read and digested all your papers in their entirety, but have read their abstracts and they all either use SELEX or some other affinity sensitive method of PWM assessment (data for which is absent for the vast majority of PWMs) and focus on Zn finger binders (some of the best understood binding interactions but not by far the only motif used in binding DNA).

The majority of TF PWMs are derived from frequency of bases at known or predicted binding locations. Frequency not only does not directly imply affinity, but it theoretically should not be able to. Take for example the pattern forming TFs in Dmel (things like dorsal and the Hox fam). They set up gradients of the TF concentrations from pole to pole or dorsal to ventral. TFBS locations at the concentrated pole are low affinity binding locations (kinetics allows this bc of the high concentration of TF). The opposing pole has the lowest concentration and TFBS responsible for driving gene expression there have the HIGHEST affinities because of it. The majority of bound locations lay in between, and display intermediate binding affinities. If one were to build a PWM based on these data the positions with the highest base frequencies are NOT the ones responsible for producing the tightest binding sites.

The majority of PWMs in databases like TRANSFAC and JASPAR are not calibrated to emit binding affinities but to emit probabilities that the tested site agrees with the described set of locations. If you have SELEX or actual affinity-based PWMs this does not apply to you but 1. you are in a fortunate minority 2. you still can't expect to distinguish real from false positive at the genome-scale because of multiple hypothesis testing and other actual biological issues like site availability due to chromatin structure (as you rightly pointed out) and possible competitive binding with other TFs.

The OP would be better served by coming up with a region of interest using good biological reasons then look THERE and only there. And even that means very little without experimental confirmation.

Believe me when I say I wish it were different. I have been beating my head against these issues for 3 years.

ADD REPLY • link 10.8 years ago by wadunn83 ▴ 90

0

Entering edit mode

I will not argue forever on this. I agree that PWMs are approximations usually derived from frequencies. But they increadibly correlate with binding affinities. You can look at PBM data derived energetical PWMs or correlation with MITOMI data. I was amazed about getting such high correlations overall (there is of course noise but the trend is here). I agree that doing genome-scale predictions without adding up biology does not make sense. But within regions of interest, it can be a good approximation (which needs validation of course, as every predictions made using bioinformatics).

ADD REPLY • link 10.8 years ago by Anthony Mathelier ▴ 910

0

Entering edit mode

Hi, I am wondering if I use PWM derived from SELEX does it still have the strong binding site propety after I transfrom it to PWM?

As I know, SELEX can detect strong binding sites and ChIP-seq can detect strong/weak binding sites.

ADD REPLY • link 9.8 years ago by michaelchen33 ▴ 20

score 1 · Answer 1 · 2013-06-24

1

Entering edit mode

10.8 years ago

Pierre Lindenbaum 161k

as far as I understand Haploreg uses PWM to " annotate variants by their effect on regulatory motifs" .

"PWMs were then scored for instances that passed a threshold of p < 4-7 (see Touzet et al.). "

ADD COMMENT • link 10.8 years ago by Pierre Lindenbaum 161k