[摘抄]Memory Allocation/Deallocation Bottleneck?（内存分配/释放瓶颈）

本篇文章摘抄于——http://stackoverflow.com/questions/470683/memory-allocation-deallocation-bottleneck，主要讲了C/C++堆内存分配和释放导致性能问题的原因，并给出了基本的解决方案。

C/C++堆内存分配和释放，即通过malloc/free或new/delete操作内存，主要会引起两个问题：一是内存碎片；二是堆内存操作需多线程同步，加锁（OS级别）。

============================================================================

Q:How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?

Note: I'm not talking about real-time stuff here. By performace-critical, I mean stuff where throughput matters, but latency doesn't necessarily.

Edit: Although I mention malloc, this question is not intended to be C/C++ specific.

A1:It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.

In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.

A2:Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.

In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.

Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.