Remove rows in a matrix containing Hamming distance less than a set value
1
0
Entering edit mode
8.3 years ago
confusedious ▴ 470

Hello everyone,

Using the dist.hamming function of 'phangorn' in R I have created a matrix of raw Hamming distance scores for a nucleotide alignment. Let's call this object 'matrix'.

I would like to remove any rows in this matrix object that contain a Hamming distance equal to or less than a given value, in any column.

I am aware this may read as a fairly basic R question, but I have not found a way to do this.

For example, I found this suggestion:

matrixless <- matrix[rowSums(matrix>=100)==ncol(matrix),]

Other ways that come to mind involve looping, and I'm sure it doesn't need to be that complex.

But this seems to return zero rows regardless of the value I use.

I would appreciate any help you may have.

R alignment • 6.1k views
ADD COMMENT
1
Entering edit mode

Can you be more specific?

So do you want to remove any row that has a value less than or equal to any values in their corresponding column?? Can you give a small toy example?

ADD REPLY
0
Entering edit mode

Sure.

Let us say we have an alignment of 3 sequences as a small example, with Hamming distances given as follows in a matrix:

     [,1] [,2] [,3] 
[1,]    0    1    3 
[2,]    1    0    7 
[3,]    3    7    0

Let us say that I think that a Hamming distance of 1 is too small. I would like to remove any column with a Hamming distance equal to or less than 1 (I am aware that it cannot be lower than one in this example, but just go along with me here). Ideally, I'd like row one to be gone.

In reality, I have an alignment of around 6000 sequences, and I am trying to isolate a more manageable subset of these that is most representative of the diversity in the alignment.

ADD REPLY
1
Entering edit mode

This is unlikely to be a correct example, at least the diagonal should be 0! And Hamming distance should be symmetric.

ADD REPLY
1
Entering edit mode

Btw, there's a problem here, you have created a distance matrix, it should be symmetric shouldn't it: dist(A,B) = dist(B,A) ?

Then you cannot only remove rows, but you need to remove the corresponding columns as well, to keep the matrix symmetric!

ADD REPLY
0
Entering edit mode

Apologies, yes it was a clumsy example, and I have edited what I had first added - of course the real matrix is symmetrical and I would need to remove both rows and columns.

With this in mind, how do I do this?

ADD REPLY
3
Entering edit mode
8.3 years ago
Michael 54k
#Edit: if the matrix is holding pairwise distances, keep it symmetric
stopifnot(isSymmetric(my.matrix)) ## check if it is symmetric in the first place
diag(my.matrix) <- my.threshold + 1
ind <- apply(my.matrix, 1, min) > my.threshold # it is symm. so it doesn't matter if we 
#use rows or cols
diag(my.matrix) <- 0 # put the diag back to 0
# make a distance matrix from it, this is useful for most applications like clustering:
my.dist <- as.dist(my.matrix[ind,ind])

# 1)
matrixless <- my.matrix[apply(my.matrix, 1, min) > my.threshold, ] 
# filter by row minimum 
# using my.threshold as a threshold

# 2)
# another way to express the same thing, might or might not be faster
matrixless <- my.matrix[!apply(m, 1, function (x) {any(x<=my.threshold)}), ]
ADD COMMENT
0
Entering edit mode

This looks like a good prospect, thank you.

What does the '1' in

my.matrix[apply(my.matrix, 1, min) > my.threshold, ]

Stand for?

ADD REPLY
1
Entering edit mode

1 means, apply function min to rows, 2 means apply to columns, see ?apply. We have to use apply, because there is no function rowMins in R (nlike rowSums).

E.g. if you wanted to filter the columns instead of rows you could:

my.matrix[ ,apply(my.matrix, 2, min) > my.threshold] # filter columns
ADD REPLY
0
Entering edit mode

I have tried using this function using c(1, 2) instead of 1 or 2 in order to have it affect both rows and columns as would be needed for a symmetrical matrix (as per the documentation for the function 'apply'), but it is returning a subscript too long error.

Sorry again for the clumsy initial question. Would you have any insight as to what is happening here?

ADD REPLY
1
Entering edit mode

Check the edited answer please.

ADD REPLY
0
Entering edit mode

Thank you for the edit. This looks like a step forward, but I just want to clarify the outcomes of using this.

Just to give some information on my actual dataset, it is has 187 rows/columns, with values ranging from 0 (diagonal) to 1796, with a mean of around 800. The first quartile for most samples is around 100, so for my practice runs I set the cutoff at 100 to see if I only retain comparisons where the minimum is greater than 100. When I run the following:

my.threshold <- 100
stopifnot(isSymmetric(my.matrix))
ind <- apply(my.matrix, 1, min) > my.threshold

It seems to return a Boolean dataset, with all values being FALSE, and the same happens if I lower my.threshold to 10. Is this because every row/column contains a zero in the form of the diagonal and thus no row/column is retained?

When I then run:

my.dist <- as.dist(my.matrix[ind,ind])

The result when I examine the my.dist object is:

dist(0)
ADD REPLY
1
Entering edit mode

It seems to return a Boolean dataset, with all values being FALSE, and the same happens if I lower my.threshold to 10. Is this because every row/column contains a zero in the form of the diagonal and thus no row/column is retained?

Yes, you are right, that's in fact something I should have thought about. A quick fix would be to simply set the diagonal to the threshold, see another edit. That's why filtering distance matrix is a bit more tricky, if you'd filter the opposite way with a maximum distance threshold, you would retain all rows/columns if you forget about the diagonal.

One cannot simply set all 0 entries to threshold because there might be other entries apart from the diagonale that have low distance.

ADD REPLY
0
Entering edit mode

Thank you again for this - this seems to be working quite well.

I have a larger matrix made from an alignment of around 7000 sequences. I will try this out with that when I am back in the lab tomorrow and see how it goes.

ADD REPLY
1
Entering edit mode

You're welcome, if you run into memory problems or need more advanced matrix operations or sparse matrices, try the Matrix package in R.

ADD REPLY
0
Entering edit mode

I have this working well with my larger alignment - no memory issues at this point as I have a lot on tap (it uses about 7gb at the most and I have 32gb). I will refine it further if I need to wheel the script out again.

Thank you again for all of the help.

ADD REPLY

Login before adding your answer.

Traffic: 2382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6