How Can I Re-Format My Dna Motifs' Position Weight Matrices (Pwms)?
1
0
Entering edit mode
10.1 years ago
a1ultima ▴ 840

I am working with a set of DNA motifs that are predicted as potential regulatory motifs (e.g. transcription factor binding sites). The motifs belong to several species, and I wanted to cluster these motifs via their Position Weight Matrices (PWMs) (also known as PSSMs) to collapse similar motifs together into groups.

A tool called MATLIGN (website here) does what I need, but their required format for the PWMs are different to what I have, they claim:

"Matrices must be in the frequency matrix format (only integer numbers are acceptable)"

The problem is that my PWM matrices do not have integer numbers but decimals instead. e.g.:

     A        C        G        T
1    0.000000 1.000000 0.000000 0.000000
2    1.000000 0.000000 0.000000 0.000000
3    0.000000 0.000000 1.000000 0.000000
4    0.000000 0.421755 0.000000 0.578245
5    0.289407 0.000000 0.282556 0.428038

In other words, instead of the decimal values I have in my matrix I need to have integer counts. Could anybody suggest what I can do? Would I need to create artificial counts?

dna motif matrix • 4.3k views
ADD COMMENT
2
Entering edit mode

That looks a lot like a position frequency matrix (PFM) where the counts were divided by the row total. Unless you know that this had a background nucleotide frequency taken into account you can probably just multiply everything by a constant and round to make it into 'counts'. You can also use a tool like TOMTOM to do this where it doesn't require integers.

ADD REPLY
0
Entering edit mode

@UnivStudent: I have actually used TOMTOM before, unfortunately they only do pairwise motif comparisons. I was hoping to use a more advanced method that carries out clustering as well.

ADD REPLY
1
Entering edit mode

PWM contain less information than the actual counts. Where did you obtain the PWM from? Try to find the counts or actual sequences as well.

ADD REPLY
2
Entering edit mode
10.1 years ago
a1ultima ▴ 840

After looking more closely at my data, I noticed that with each of the PWMs I had there was a value called nSites.

It turns out that nSites refers to the number of DNA sequence regions, or sites, that were used to originally generate the PWMs.

Solution:

With this I was able to convert my PWMs into integer counts by multiplying the proportions by the nSites value.

ADD COMMENT

Login before adding your answer.

Traffic: 3237 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6