John Ratcliff's Code Suppository

A place where I insert my code into the anus of the Internet.

 

Tuesday, September 29, 2009

Memory allocations are not 'bad' Debunking the myth



I recently had a conversation with an engineer at a game company where he expressed annoyance that his development team was using that damned STL (STLPort in particular). I asked why and he felt that way. His response was that he couldn't stand how much memory allocation that the STL was doing. He also expressed annoyance that the performance of STL map was so slow and his programmers kept using it.

This raises a couple of points. First of all, there is rarely a reason to use STL map. The STL map/set is a red-black tree. The only reason you would ever use it is if you needed to maintain a set of objects in perfectly sorted order. This use case doesn't come up a lot, though, and most people are using it just to do key/value lookups. However, if all you want is a key/value lookup you should instead be using the hash_map.

What I really want to address in this post is the myth that 'memory allocations' are 'bad'.

What does someone mean by 'bad'?

I think what the mean is as follows:

  • Memory allocations are slow
  • Memory allocations cause fragmentation
  • Memory allocations cause thread contention
  • Memory allocations have way too much overhead
  • Memory allocations are not cache-coherent
Developers often hold these assumptions, but are they really true?

What if none of these things were true? Then, do you really have a valid complaint about how many allocations the STL is doing, or your other container classes?

With the micro-allocator I presented in the previous post, none of these criticisms are really valid.

Myth #1 : Memory allocations are slow : With the micro-allocator I presented memory allocations, and frees, are extraordinarily fast. A memory allocation simply involves returning a free pointer from a linked list 99.9999% of the time.

One reason people sometimes think that memory allocators are slow is...because they are *when running in a debugger in a debug build*. It is important to measure real performance outside of the debugger and in a release build when considering the cost of your memory allocation system.

Myth #2 : Memory allocations cause fragmentation : The design of my micro-allocator is that you give it a budget which is equal to an estimate of the maximum amount of memory needed for micro-allocations by your application. This is allocated a single time and carved up in power of two multiples. The allocator works within this single fixed block of memory and, so long as you do not exceed the budget estimate, it will never range outside this block of memory. No fragmentation, in the sense of memory being returned from all over the address space, will occur.

Myth #3 : Memory allocations cause thread contention : Last year when I was working on my JobSwarm code snippet I spent some time making sure it worked with a lockless queue. Once I had this working I tried to create a benchmark that would show the performance benefit of a lockless queue versus a standard mutex lock. What I found, is that it was very difficult to create a scenario where any significant performance distinction could be measured. It turns out, that unless you have an application which is spending almost all of its time in contention, you can't even quantify a benefit. Add to this the fact that my micro-allocator is so fast, usually just taking a few instruction cycles, any contention from a thread lock is going to be very small unless the only thing your application does *is* memory allocation in multiple threads simultaneously.

All that said, ideally a micro-allocator should entirely avoid or minimize thread contention. There are commercial micro-allocators that do this. The most well known is The Hoard Memory Allocator. Hoard is specifically designed to be highly efficient in multi-threaded environments. Also, as we move to more and more cores and, therefore more and more simultaneous threads, minimizing thread contention is important.

For my current implementation of the micro-allocator I just threw a mutex lock around the allocation and free routines. This was a quick and lazy approach to the problem though, as I said before, I doubt it is even an issue in most scenarios. That said, I would like to optimize it to be lockless, or near lockless, in the future. (It appears that I can download an evaluation copy of Hoard so sometime in the next week or two I will try to run benchmarks between it and my micro-allocator and I will report the results.)

Myth #4 : Memory allocations have way too much overhead : Most memory allocators will stick a header block in front of every allocation. This header block will often include linked list pointers and possibly a sentinel in a debug build.

The micro-allocator I presented, however, does not introduce any additional overhead. If you allocate 8 bytes, it takes exactly 8 bytes of memory. If you allocate 16 bytes, it takes exactly 16 bytes, etc. Now, if you allocate 14 bytes, it will, in fact, return a block of memory that is actually 16 bytes in length (the nearest power of two) thus 'wasting' the 2 bytes in between.

Perhaps the micro-allocator could be considered as having 'too much overhead' in the sense that it is a power of two allocator and always rounds up to the nearest power of two boundary. However, since it works primarily in a fixed budget of memory for small allocations, this is probably acceptable.

Myth #5 : Memory allocations are not cache-coherent : My micro-allocator is a power of two allocator and, especially since it requires no header block, always returns cache aligned blocks of memory. Any allocations of small objects in a sequence will naturally end up sequentially in the same area of memory.

All allocations 16 bytes to 256 bytes in size will be perfectly aligned on a page boundary. 8 byte allocations will lie on an 8 byte boundary or, every other one, 16 bytes.

Conclusion

The point I am trying to make here is that the concerns people have about using the STL, or other containers which do per-object allocations, are largely irrelevant *if* you have a high-speed micro-allocator under the hood.

There is no need to make special containers and special small object allocators if you have confidence that your core allocator already has these behaviors built right in.

Over the course of the next few months I plan to be doing some optimizations on console platforms using the micro-allocator I wrote and will be reporting any performance numbers I come up with.

So far I have only tested it extensively on the PC and it really doesn't show any performance difference between standard malloc/free which, as I pointed out earlier, is surprisingly fast anyway.

The lesson to be learned is that you shouldn't complain about the 'cost' of lots of tiny memory allocations until you know for fact that your application is really feeling a legitimate cost. If you find that there is a significant cost by using the standard library malloc/free, don't resort to writing a bunch of custom allocators and template code until you have evaluated a high performance general purpose small object allocator like Hoard, Doug Lea's malloc, or my own micro allocator I have posted here.

Sunday, September 06, 2009

A free open source micro allocator in C++




My code snippet today is a free open source micro-allocator. As many of you know, using the STL (or containers in general) will cause many,many, tiny allocations to be performed. Even just in a standard game engine, you will constantly be doing lots of small allocations for objects.

Now, some people work around this by writing custom pool allocators. However, that is not really necessary. If you have an efficient general purpose micro-allocator that is 'order-N' you can bank on the fact that it is always available. Don't worry about how much a memory allocation costs if it is less than or equal to 256 bytes in size. Don't worry about it not being cache aligned or cache-coherent. That will happen automatically.

The way my micro allocator works is that you give it an initial estimate of memory usage for micro-allocations. In a large game project, it might be good to give it a megabyte or two. In fact, it is ideal if you give it the maximum amount of memory your application will ever use for micro-allocations. You can find this out by running statistics on your application. If you give it too conservative an estimate, then the performance of the micro-allocator will decrease significantly as it will begin allocating a lot of additional chunks and will need to search them for frees.

If you are sure your application tends to use say 10 megabytes for micro-allocations, then give it that much to work with initially.

The amount of memory reserve is equal to the default chunk size, times six. As there are six powers of two from 8 to 256.

It then creates six fixed pool allocators for all of the power of two multiples from 8 to 256 bytes. It will do only one original heap allocation for this initial buffer and divvy up the memory usage equally between the six buffer sizes. This initial heap is how it can quickly know whether or not an address is owned by the micro-allocator and in which fixed pool it belongs in.

If you exceed this initial size, that is fine, the pools will simply grow based on your initial chunk size passed in.

For the vast majority of the time an allocation simply returns the current free pointer. A free simply patches the free list pointer as well.

This system is similar to the LOKI memory allocator but it does not use the STL or any other external libraries. It is also thread safe.

Here is the source code. Feedback welcome.

MicroAllocator.h
MicroAllocator.cpp

You can also get the source from here on Google Code.

Here are some hard statistics about the micro allocator. First of all, I need to say that I am very impressed with the speed of the default allocator built into Microsoft Visual Studio's run-time. It is very fast indeed.

I created a stress test which adds ten million small objects to an STL list, iterates that list, and then deconstructs the list. The way the STL implements lists each object causes a memory allocation to occur.

In release build (outside of the debugger) here are the results. (It is important to note that memory allocations (either debug or release) are much slower if the Visual Studio debugger is present. This is using STLPort. The first number is using malloc/free the second number is using the micro allocator.

Construction: 579 milliseconds to 422 milliseconds.
Iteration: 44 milliseconds to 44 milliseconds
Deconstruction 431 milliseconds to 154 milliseconds

As you can see allocation time was about 30% faster. Iteration was the same. Deconstruction time was 300% faster!

Here is another piece of information showing the difference between the STL provided with Microsoft Visual Studio and STLPort. The version of STLPort which is being tested has exception handling and thread safety disabled.

Construction STD:: 676 milliseconds to STLPort 567 milliseconds
Iteration 62 milliseconds to STLPort 44 milliseconds
Deconstruction STD 451 milliseconds to STLPort 402 milliseconds

As you can see STLPort is just slightly faster than the STL which comes with Microsoft Visual Studio 2005. However, the difference could simply be accounted for by the fact that thread safety and exception handling were disabled on this build of STLPort.