John Ratcliff's Code Suppository

A place where I insert my code into the anus of the Internet.

 

Thursday, December 02, 2010

K-Means Clustering Algorithm

I haven't uploaded a new code snippet in quite a while, not because I haven't been coding, I've just been coding in a certain unnamed behemoth game engine that doesn't lend itself to writing anything in 'snippet' form.
However, I am working on a new project for a short period of time and yesterday I needed to solve a problem that, apparently, lots of people have solved before.   I have a large set of data points (in this case representing objects at different scales) and I needed to reduce it to a smaller subset of the (N) number of most significant sizes.

This is a common problem, and many developers have seen it when trying to reduce a color-space.  (For example reducing an image with say 16k colors to only 256 colors used to be a common thing people needed to do).

One of the more elegant solutions to this problem is by building an octree; at least one version of this algorithm is called 'Wu Quantization'.  This was popularly presented in Graphics Gems.
What I needed yesterday was something very quick and dirty and, that turned out to be the K-means clustering algorithm.  You can read about it here on Wikipedia and here is another link comparing it to a different algorithm.

The algorithm itself is very straightforward.

Let's say you have 100 input data points, but you want to reduce them to say 10 clusters where the center of each cluster represents the mean of the surrounding points.

Step #1 : You seed the 10 clusters with the 10 sample points taken from the input data.

Step #2 : You set an temp array of counts to zero and temp points to zero for each cluster you are building.

Step #3 : You iterate through each of the input data points.  For each point you figure out which of the current clusters it is closest to.  You then add that point to the new set of centroids and increment a counter.

Step #4 : You accumulate a total distance error of each point from its nearest cluster.

Step #5 : You compute the average center position for each of the centroids and assign them as your new clusters.

Step #6 : You compute the difference between the distance error on this pass and the distance error on the previous pass.  If the difference in the error is less than a tolerance passed in, you are done.  If the error is still converging, then you go back to step #2 and continue the loop.

My implementation also added the ability to prune the results.  Pruning is needed if all of the input points are at the same location, or there are fewer input points than clusters requested.

Another reason to prune the output is in the following example.  Let's say you have 100 data points that you want to put into 10 clusters.  However, all of those data points are right next to each other.

The standard K-means algorithm will still produce 10 clusters, since it is basically unit-less.  However, this implementation allows you to specify a tolerance threshold that says if two clusters are practically next to each other, simply merge them.

The main reason I am posting this code snippet is because of the absolutely abhorrent code snippet I found on the internet.

Click on this link to see the C implementation version that I found on the Internet.  Clear as mud right?

kmeans.c

Oh yeah, I would like to also add this.  There is a link to a C++ implementation on the Wikipedia page.  I downloaded that link and it uses not only the STL but BOOST as well!

Yikes.  Mine has a vanilla C style function interface, accepts pointers to 1d, 2d, 3d, or 4d arrays of data points as either a const float pointer or const double pointer and uses no STL, or BOOST, or any other fluffy nonsense.


Now, here is the header file and C++ for the version that I wrote.  I hope you will agree it is a lot easier to understand and use.  As always, feedback welcome.


kmeans.h
kmeans.cpp

6 Comments:

  • At 3:37 PM, Anonymous Anonymous said…

    You should check out OpenCV. It has kmeans plus much much more.

     
  • At 9:49 PM, Blogger Shawn Presser said…

    Amazing John, thank you. All of your snippets are.

    I do not have anything of substance to say, but I can't bear the thought of the only other comment being some snide remark about "derp derp OpenCV already does this".

    I was going to write a rant to that person. I was going to point out that to even *try* OpenCV, you have to follow their gargantuan install guide ( http://opencv.willowgarage.com/wiki/InstallGuide ). Just look at all of those prerequisites! Do people not understand that the code snippets on Codesuppository can be used *anywhere*, by *anyone*, in *any C++ project*? Maybe they don't realize that you can't "Just use Boost". Apparently they prefer to "work" by spending all of their time on:

    - researching massive frameworks which happen to solve their problem of the day

    - then installing that framework

    - then configuring that framework

    - then locating the particular part of that framework that claims to solve the problem

    - then research exactly how to attempt to use their interface to solve the problem

    - then checking whether it's even optimized enough to be useful in a game engine, and if it's not, you're screwed and just wasted the entire day. Oh wait, "game engine"? Nobody understands you need to use the code in a *GAME ENGINE*. I'm convinced that if games weren't so massively popular then game developers would be shunned by the C++ community as lepers / pariahs.

    Anyway, I'm really glad to see that you're still making snippets. I check here pretty much daily and was excited to see this update. I wanted to thank you for your small-block memory allocator, but you haven't been on AIM in the last couple months. I successfully integrated it into Heroes of Newerth, and it shaved ~5-10sec off of loading time which was great! (Run time performance wasn't noticeably affected, which seems like a good thing.) And since I went through the pain of trapping every allocation, I decided to use your fantastic memory tracking tool as well. It's really cool that it generates an HTML memory report. https://dl.dropbox.com/u/315/s2/memreports/11-02-10/1/memreport_replay_5v5.html

    So yes, if it's any consolation, at least one other person does in fact find Codesuppository incredibly valuable. I use your snippets all the time; in fact I just added your Regex snippet. They save me a lot of time, so I have no idea why most people's comments sound so negative on here. Most other people who post on this blog can be summed up as "Haters gonna hate". Don't let 'em get you down, they don't know any better.

    How have you been? I hope things are going well on your new project. Good luck to you, and post more snippets please!

     
  • At 10:20 PM, Blogger John said…

    Thanks Shawn. Yeah, if someone doesn't understand the value of having code in 'snippet' form I'm not going to be able to explain it to them.

    And, nothing really 'gets me down'. I use my own snippets all the time, I couldn't survive without them, if someone else finds them useful that's just bonus.

    I just recently did something cool with my micro-allocator in conjunction with my memory allocate; maybe I'll post that too.

    I also have received some really positive feedback on my texture packing code that was quite gratifying.

    Really, I'm just about the only guy other than Sean Patrick Barrett posting code in 'snippet' form.

    I dont think enough people realize how powerful it is. For example even though I use tons of matrix math libraries in my daily work, I still use 'floatmath' as a reference all the time.

    Thanks,

    John

     
  • At 8:21 PM, Anonymous Anonymous said…

    I certainly didn't mean to cause any offence. I use John's code snippets an awful lot as well and am an avid reader of this blog. Sorry if I have upset anyone. It is nice to see John has such good and protective friends/readers.

    From the way I interpreted John's description of the reasoning for the existence of the snippet it sounded like John felt a good library should exist that does all this.

    Since a computer vision library isn't the obvious place to look for clustering algorithms I just thought I would point that out. I don't think most game developers are familiar with OpenCV.

    Shawn: OpenCV has been a one-click install for me (on windows, I use it on OSX and embedded platforms as well!), if you want to do the full debug build you will need to use CMake, which is a pretty standard build tool. I certainly agree that researching massive frameworks is a big pain, and thats what I was trying to solve with this post. I know John always wants code that is easy to use, optimized and cross-platform, which I believe OpenCV is.

    Next time I'll write a bit of a longer post so that it doesn't get misinterpreted!

    Keep up the good work John!

     
  • At 12:36 AM, Blogger Shawn Presser said…

    No worries. Looking back on it, I apologize for the disrespectful tone of my post. It is surprising and delightful that you have followed up; most people don't take the time, because they don't care. So thank you for your post.

    I agree that small libraries are a good thing. Frameworks can be a good thing too --- for example the Qt framework. It's unfortunate that most frameworks are written in such a way where you can't really grab one specific component to accomplish one specific task, without also bringing in all the other components. But that's not to say they don't have any value... just that they can be very cumbersome sometimes. Boost, for example.

    You know, Unicode doesn't seem like a very effective communication tool. :) A lot of context and emotions tends to get lost or misinterpreted in e.g. emails or comments.

    Anonymous, I hope to be able to learn from you in the future. Good luck to you.

     
  • At 2:11 AM, Anonymous stephan said…

    thanks for this!

    i'm playing around with (open)kinect at the time and this is really useful for depth image segmentation.

     

Post a Comment

<< Home