Memory allocations are not 'bad' Debunking the myth

I recently had a conversation with an engineer at a game company where he expressed annoyance that his development team was using that damned STL (STLPort in particular). I asked why and he felt that way. His response was that he couldn't stand how much memory allocation that the STL was doing. He also expressed annoyance that the performance of STL map was so slow and his programmers kept using it.
This raises a couple of points. First of all, there is rarely a reason to use STL map. The STL map/set is a red-black tree. The only reason you would ever use it is if you needed to maintain a set of objects in perfectly sorted order. This use case doesn't come up a lot, though, and most people are using it just to do key/value lookups. However, if all you want is a key/value lookup you should instead be using the hash_map.
What I really want to address in this post is the myth that 'memory allocations' are 'bad'.
What does someone mean by 'bad'?
I think what the mean is as follows:
- Memory allocations are slow
- Memory allocations cause fragmentation
- Memory allocations cause thread contention
- Memory allocations have way too much overhead
- Memory allocations are not cache-coherent
What if none of these things were true? Then, do you really have a valid complaint about how many allocations the STL is doing, or your other container classes?
With the micro-allocator I presented in the previous post, none of these criticisms are really valid.
Myth #1 : Memory allocations are slow : With the micro-allocator I presented memory allocations, and frees, are extraordinarily fast. A memory allocation simply involves returning a free pointer from a linked list 99.9999% of the time.
One reason people sometimes think that memory allocators are slow is...because they are *when running in a debugger in a debug build*. It is important to measure real performance outside of the debugger and in a release build when considering the cost of your memory allocation system.
Myth #2 : Memory allocations cause fragmentation : The design of my micro-allocator is that you give it a budget which is equal to an estimate of the maximum amount of memory needed for micro-allocations by your application. This is allocated a single time and carved up in power of two multiples. The allocator works within this single fixed block of memory and, so long as you do not exceed the budget estimate, it will never range outside this block of memory. No fragmentation, in the sense of memory being returned from all over the address space, will occur.
Myth #3 : Memory allocations cause thread contention : Last year when I was working on my JobSwarm code snippet I spent some time making sure it worked with a lockless queue. Once I had this working I tried to create a benchmark that would show the performance benefit of a lockless queue versus a standard mutex lock. What I found, is that it was very difficult to create a scenario where any significant performance distinction could be measured. It turns out, that unless you have an application which is spending almost all of its time in contention, you can't even quantify a benefit. Add to this the fact that my micro-allocator is so fast, usually just taking a few instruction cycles, any contention from a thread lock is going to be very small unless the only thing your application does *is* memory allocation in multiple threads simultaneously.
All that said, ideally a micro-allocator should entirely avoid or minimize thread contention. There are commercial micro-allocators that do this. The most well known is The Hoard Memory Allocator. Hoard is specifically designed to be highly efficient in multi-threaded environments. Also, as we move to more and more cores and, therefore more and more simultaneous threads, minimizing thread contention is important.
For my current implementation of the micro-allocator I just threw a mutex lock around the allocation and free routines. This was a quick and lazy approach to the problem though, as I said before, I doubt it is even an issue in most scenarios. That said, I would like to optimize it to be lockless, or near lockless, in the future. (It appears that I can download an evaluation copy of Hoard so sometime in the next week or two I will try to run benchmarks between it and my micro-allocator and I will report the results.)
Myth #4 : Memory allocations have way too much overhead : Most memory allocators will stick a header block in front of every allocation. This header block will often include linked list pointers and possibly a sentinel in a debug build.
The micro-allocator I presented, however, does not introduce any additional overhead. If you allocate 8 bytes, it takes exactly 8 bytes of memory. If you allocate 16 bytes, it takes exactly 16 bytes, etc. Now, if you allocate 14 bytes, it will, in fact, return a block of memory that is actually 16 bytes in length (the nearest power of two) thus 'wasting' the 2 bytes in between.
Perhaps the micro-allocator could be considered as having 'too much overhead' in the sense that it is a power of two allocator and always rounds up to the nearest power of two boundary. However, since it works primarily in a fixed budget of memory for small allocations, this is probably acceptable.
Myth #5 : Memory allocations are not cache-coherent : My micro-allocator is a power of two allocator and, especially since it requires no header block, always returns cache aligned blocks of memory. Any allocations of small objects in a sequence will naturally end up sequentially in the same area of memory.
All allocations 16 bytes to 256 bytes in size will be perfectly aligned on a page boundary. 8 byte allocations will lie on an 8 byte boundary or, every other one, 16 bytes.
Conclusion
The point I am trying to make here is that the concerns people have about using the STL, or other containers which do per-object allocations, are largely irrelevant *if* you have a high-speed micro-allocator under the hood.
There is no need to make special containers and special small object allocators if you have confidence that your core allocator already has these behaviors built right in.
Over the course of the next few months I plan to be doing some optimizations on console platforms using the micro-allocator I wrote and will be reporting any performance numbers I come up with.
So far I have only tested it extensively on the PC and it really doesn't show any performance difference between standard malloc/free which, as I pointed out earlier, is surprisingly fast anyway.
The lesson to be learned is that you shouldn't complain about the 'cost' of lots of tiny memory allocations until you know for fact that your application is really feeling a legitimate cost. If you find that there is a significant cost by using the standard library malloc/free, don't resort to writing a bunch of custom allocators and template code until you have evaluated a high performance general purpose small object allocator like Hoard, Doug Lea's malloc, or my own micro allocator I have posted here.


5 Comments:
At 11:55 AM,
cbloom said…
Amen, brother.
There are reasons to not like the STL or memory allocation, but they are much more complex than the simple complaints that most people have.
Furthermore, the fact is that "allocation" of some kind is *inherent* to most programming problems. By not using the STL you are not "avoiding allocation" you are simply using a more simplistic scheme such as fixed static allocation, which is may or may not be a good choice (and is independent of whether or not you use STL containers).
At 7:25 PM,
fury said…
I think it depends on the platform and experience. I've been in situations where Memory allocation has been the bottleneck on a PS2. Essentially we removed all allocations during the actual real-time part of the game and suddenly out performance issues were solved.
Does that mean STL is bad? No. Particularly on the PC. Also I've seen many C alerantives to STL which have actually been slower (although the programmer is under the false impression of it being better).
Ways to improve allocation.
1) Use allocators (these are great because now you have more control then C over allocations).
2) If allocation speed is your problem then switch the global allocator to NedMalloc and allocate 10x faster. If fragmentation is a problem then find a good low fragmentation global allocator.
At 9:32 AM,
sourcerernl said…
I think your definition of fragmentation is not the same as the one I am used to. It's not whether or not an allocation goes allover the memory range, it's two seperate allocations block a third from happening because the space is technically there but not consecutive. Of course a fixed size allocator wouldn't have this problem, but a free sized allocator can't really prevent this without making pointers relocatable which is inherently expensive. I can't really consider this myth debunked by your article. Secondly, the myth isn't 'memory allocations are bad', but 'dynamic memory allocations are bad'. Using a pool with preallocated memory there's no allocation going, just pointer assignment.
At 10:31 PM,
Manatee of Mystery said…
But the issue is how do make a lib allocate in the way you want, especially if it is in some separate DLL or has some other issue, with different objects? You can do this in STL but what a pain. Much better to just use something that does what you want. The improper use of map is a great example. If they rolled their own data structures this would never occur as they don't need a map. All the extra nonsense in STL is very detrimental. Huge blank spots of missing things and yet an amazing amount of silly things most people should avoid like cancer that embrace the worst features of C++ wholeheartedly.
At 12:06 AM,
Anonymous said…
If something like changing memory in another dll is a problem where memory hooking can be very useful. Libs like nedmalloc allow you to globally or selectively replace the allocators. This gives a lot more breath for optimization then simply writing your own containers. In many cases you can't beat the speed of something like netmalloc (are you going to make all your allocations use 8 threads at the sametime to find the free section?)
Not that you can get different sorts of memory allocation benefits from paying attention to the sorts of containers you use.
From the point of view of fragmentation I've see a lot of people who like to preallocate memory and use the free-list pattern and when its all said and done the whole program runs slower.
The free-list approach is not a bad technique however as often is the case its sacrificing memory for cpu performance. However if you apply it across the entire program you get a downgrade because of fragmentation (programs work best with tight memory configurations) and increased memory usage.
Its best to identify the <.1% of code that is a performance bottleneck and optimize that (and make no mistake is is less then .1% in any decent program). Spend the rest of the time making your program maintainable so that you can quickly and easy make optimizations when you need to.
Post a Comment
<< Home