DAPC different clustering results
0
0
Entering edit mode
5.9 years ago
User000 ▴ 690

Dear all,

I am using find.clusters and DAPC to my SNPs data. I am interested in K 2-20. However, the clustering results are different whenever I re-run the code on the same dataset, I guess this is due to k-means algorithm that find.clusters is using. Do you know it if possible to find an optimum center, to get some stable results or how to improve it?

grp <- find.clusters(obj1, n.clust = 20, n.pca = 500, stat = "BIC", n.iter = 100000, n.start = 1000)
dapcc <- dapc(obj1, grp$grp, n.pca = 50, n.da =7)
dapc R find.clusters k-means • 2.9k views
ADD COMMENT
1
Entering edit mode

How different are the clusters ? Do they fluctuate a lot at each run ?

To optimize your clustering, i would run, say 50 times, find.clusters function and get the optimal number using your stat= "BIC" which is the statistical measure of goodness of fit. And to reproduce your results use set.seed(x).

ADD REPLY
0
Entering edit mode

No, I have two different versions, and I am interested in one of them, the problem is I want consistent results for all runs from K 2 to K 20. I already know that I want to stop at K 20, and not go further. So I literally have to re-run the same command line 50 times? Isn't n.start = 1000 doing this? Could you please explain better? also set.seed(x), which number should I use? the one that showed better result? Thank you a lot!

ADD REPLY
0
Entering edit mode

update: for n.clust = 20 I did set.seed(20), and it worked 3 times, for n.clust = 19, should I use also set.seed(20)?

ADD REPLY
0
Entering edit mode

For each run use different set.seed() it could be any integer number. So if you set to 20 you will always get same results if you run same code. Just use different set.seeds.

ADD REPLY
0
Entering edit mode

yes, but I want to get the consistent results, similar to admixture, like from K = 2 to K = 20 I want to see consistent results of species separating at the end in 20 different clusters...I hope I could explain myself.

ADD REPLY
1
Entering edit mode

To run K = 2 to 20 you should use max.n.clust instead of n.clust, otherwise you are only runing k-means once.

n.clust an optinal integer indicating the number of clusters to be sought. If provided, the function will only run K-means once, for this number of clusters. If left as NULL, several K-means are run for a range of k.

max.n.clust an integer indicating the maximum number of clusters to be tried. Values of 'k' will be picked up between 1 and max.n.clust.

ADD REPLY
0
Entering edit mode

So, if I run max.n.clust = 20, how can I get the membership values then for K 2-20?

ADD REPLY
0
Entering edit mode

When running find.clusters the optimal number of clusters to retain is assessed based on BIC value and stored into grp same way as before.

ADD REPLY

Login before adding your answer.

Traffic: 3123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6