Memory allocations are not 'bad' Debunking the myth

I recently had a conversation with an engineer at a game company where he expressed annoyance that his development team was using that damned STL (STLPort in particular). I asked why and he felt that way. His response was that he couldn't stand how much memory allocation that the STL was doing. He also expressed annoyance that the performance of STL map was so slow and his programmers kept using it.
This raises a couple of points. First of all, there is rarely a reason to use STL map. The STL map/set is a red-black tree. The only reason you would ever use it is if you needed to maintain a set of objects in perfectly sorted order. This use case doesn't come up a lot, though, and most people are using it just to do key/value lookups. However, if all you want is a key/value lookup you should instead be using the hash_map.
What I really want to address in this post is the myth that 'memory allocations' are 'bad'.
What does someone mean by 'bad'?
I think what the mean is as follows:
- Memory allocations are slow
- Memory allocations cause fragmentation
- Memory allocations cause thread contention
- Memory allocations have way too much overhead
- Memory allocations are not cache-coherent
What if none of these things were true? Then, do you really have a valid complaint about how many allocations the STL is doing, or your other container classes?
With the micro-allocator I presented in the previous post, none of these criticisms are really valid.
Myth #1 : Memory allocations are slow : With the micro-allocator I presented memory allocations, and frees, are extraordinarily fast. A memory allocation simply involves returning a free pointer from a linked list 99.9999% of the time.
One reason people sometimes think that memory allocators are slow is...because they are *when running in a debugger in a debug build*. It is important to measure real performance outside of the debugger and in a release build when considering the cost of your memory allocation system.
Myth #2 : Memory allocations cause fragmentation : The design of my micro-allocator is that you give it a budget which is equal to an estimate of the maximum amount of memory needed for micro-allocations by your application. This is allocated a single time and carved up in power of two multiples. The allocator works within this single fixed block of memory and, so long as you do not exceed the budget estimate, it will never range outside this block of memory. No fragmentation, in the sense of memory being returned from all over the address space, will occur.
Myth #3 : Memory allocations cause thread contention : Last year when I was working on my JobSwarm code snippet I spent some time making sure it worked with a lockless queue. Once I had this working I tried to create a benchmark that would show the performance benefit of a lockless queue versus a standard mutex lock. What I found, is that it was very difficult to create a scenario where any significant performance distinction could be measured. It turns out, that unless you have an application which is spending almost all of its time in contention, you can't even quantify a benefit. Add to this the fact that my micro-allocator is so fast, usually just taking a few instruction cycles, any contention from a thread lock is going to be very small unless the only thing your application does *is* memory allocation in multiple threads simultaneously.
All that said, ideally a micro-allocator should entirely avoid or minimize thread contention. There are commercial micro-allocators that do this. The most well known is The Hoard Memory Allocator. Hoard is specifically designed to be highly efficient in multi-threaded environments. Also, as we move to more and more cores and, therefore more and more simultaneous threads, minimizing thread contention is important.
For my current implementation of the micro-allocator I just threw a mutex lock around the allocation and free routines. This was a quick and lazy approach to the problem though, as I said before, I doubt it is even an issue in most scenarios. That said, I would like to optimize it to be lockless, or near lockless, in the future. (It appears that I can download an evaluation copy of Hoard so sometime in the next week or two I will try to run benchmarks between it and my micro-allocator and I will report the results.)
Myth #4 : Memory allocations have way too much overhead : Most memory allocators will stick a header block in front of every allocation. This header block will often include linked list pointers and possibly a sentinel in a debug build.
The micro-allocator I presented, however, does not introduce any additional overhead. If you allocate 8 bytes, it takes exactly 8 bytes of memory. If you allocate 16 bytes, it takes exactly 16 bytes, etc. Now, if you allocate 14 bytes, it will, in fact, return a block of memory that is actually 16 bytes in length (the nearest power of two) thus 'wasting' the 2 bytes in between.
Perhaps the micro-allocator could be considered as having 'too much overhead' in the sense that it is a power of two allocator and always rounds up to the nearest power of two boundary. However, since it works primarily in a fixed budget of memory for small allocations, this is probably acceptable.
Myth #5 : Memory allocations are not cache-coherent : My micro-allocator is a power of two allocator and, especially since it requires no header block, always returns cache aligned blocks of memory. Any allocations of small objects in a sequence will naturally end up sequentially in the same area of memory.
All allocations 16 bytes to 256 bytes in size will be perfectly aligned on a page boundary. 8 byte allocations will lie on an 8 byte boundary or, every other one, 16 bytes.
Conclusion
The point I am trying to make here is that the concerns people have about using the STL, or other containers which do per-object allocations, are largely irrelevant *if* you have a high-speed micro-allocator under the hood.
There is no need to make special containers and special small object allocators if you have confidence that your core allocator already has these behaviors built right in.
Over the course of the next few months I plan to be doing some optimizations on console platforms using the micro-allocator I wrote and will be reporting any performance numbers I come up with.
So far I have only tested it extensively on the PC and it really doesn't show any performance difference between standard malloc/free which, as I pointed out earlier, is surprisingly fast anyway.
The lesson to be learned is that you shouldn't complain about the 'cost' of lots of tiny memory allocations until you know for fact that your application is really feeling a legitimate cost. If you find that there is a significant cost by using the standard library malloc/free, don't resort to writing a bunch of custom allocators and template code until you have evaluated a high performance general purpose small object allocator like Hoard, Doug Lea's malloc, or my own micro allocator I have posted here.


