K-Means Clustering Algorithm
I haven't uploaded a new code snippet in quite a while, not because I haven't been coding, I've just been coding in a certain unnamed behemoth game engine that doesn't lend itself to writing anything in 'snippet' form.
However, I am working on a new project for a short period of time and yesterday I needed to solve a problem that, apparently, lots of people have solved before. I have a large set of data points (in this case representing objects at different scales) and I needed to reduce it to a smaller subset of the (N) number of most significant sizes.
This is a common problem, and many developers have seen it when trying to reduce a color-space. (For example reducing an image with say 16k colors to only 256 colors used to be a common thing people needed to do).
One of the more elegant solutions to this problem is by building an octree; at least one version of this algorithm is called 'Wu Quantization'. This was popularly presented in Graphics Gems.
What I needed yesterday was something very quick and dirty and, that turned out to be the K-means clustering algorithm. You can read about it here on Wikipedia and here is another link comparing it to a different algorithm.The algorithm itself is very straightforward.
Let's say you have 100 input data points, but you want to reduce them to say 10 clusters where the center of each cluster represents the mean of the surrounding points.
Step #1 : You seed the 10 clusters with the 10 sample points taken from the input data.
Step #2 : You set an temp array of counts to zero and temp points to zero for each cluster you are building.
Step #3 : You iterate through each of the input data points. For each point you figure out which of the current clusters it is closest to. You then add that point to the new set of centroids and increment a counter.
Step #4 : You accumulate a total distance error of each point from its nearest cluster.
Step #5 : You compute the average center position for each of the centroids and assign them as your new clusters.
Step #6 : You compute the difference between the distance error on this pass and the distance error on the previous pass. If the difference in the error is less than a tolerance passed in, you are done. If the error is still converging, then you go back to step #2 and continue the loop.
My implementation also added the ability to prune the results. Pruning is needed if all of the input points are at the same location, or there are fewer input points than clusters requested.
Another reason to prune the output is in the following example. Let's say you have 100 data points that you want to put into 10 clusters. However, all of those data points are right next to each other.
The standard K-means algorithm will still produce 10 clusters, since it is basically unit-less. However, this implementation allows you to specify a tolerance threshold that says if two clusters are practically next to each other, simply merge them.
The main reason I am posting this code snippet is because of the absolutely abhorrent code snippet I found on the internet.
Click on this link to see the C implementation version that I found on the Internet. Clear as mud right?
kmeans.c
Oh yeah, I would like to also add this. There is a link to a C++ implementation on the Wikipedia page. I downloaded that link and it uses not only the STL but BOOST as well!
Yikes. Mine has a vanilla C style function interface, accepts pointers to 1d, 2d, 3d, or 4d arrays of data points as either a const float pointer or const double pointer and uses no STL, or BOOST, or any other fluffy nonsense.
Now, here is the header file and C++ for the version that I wrote. I hope you will agree it is a lot easier to understand and use. As always, feedback welcome.
kmeans.h
kmeans.cpp

