Better Allocator API

Introduction

The memory allocator interface that is available in system-programming languages like Ada, C and C++ allows the programmer to obtain new memory block of the given size and to deallocate a previously allocated block.

The API of the allocator is known to programmers as a pair of operations:

Allocate/Deallocate procedures from the System.Storage_Pools standard package in Ada,
malloc/free functions from the standard C library,
operator new/operator delete operators from the C++ standard library.

In each of these languages, the pair of operations is functionally complete in that it allows to provide dynamic storage for both individual objects and for more complex data structures. In the case of C and C++ there are additional related operations, but they do not extend the basic functionality beyond the fundamental needs of operating in terms of single memory blocks.

This "unit-oriented" nature of the allocator API has the advantage of being conceptually simple and easy to work with, but has some consequences that lead to less than optimal results in more elaborated scenarios, especially in the multithreading environment.

Buzzword: multicore systems

The "multicore" buzzword is a bit stretched here, but the problems are really related to multithreading.

The first problem is a result of the excessive and defensive synchronization within the allocator that is due to the fact that the allocator has no higher-level notion of its unit-oriented requests.

Every allocation request that is received by the allocator has to be treated in isolation, because there is no way to express the relation between different requests. This "impedance mismatch" is mostly visible when some more complex data structure is created by the program (a list, tree or anything else) - even though the program "knows" that it is going to allocate a number of memory blocks to build some structure, there is no way to pass that knowledge to the allocator and all individual blocks are allocated with separate requests. As a result, all concurrency-related efforts and protections (of whatever kind) that are build into the allocator have to be applied multiple times for what is logically a single high-level operation.

Another problem related to multithreading is more subtle and is due to so-called false sharing (explaining it in a comprehensive way is out of scope of this article, so readers who are not familiar with this phenomenon can refer to Wikipedia or other resource).

The usage patterns that lead to false sharing can appear naturally in the multithreaded environment with dynamic memory when two threads use memory blocks that were allocated one after another and happen to share (or are even completely enclosed in) the range of memory addresses that lie in the same unit of caching. This is easily achieved if the allocator allocates blocks consistently from the same side of its free pool. What happens as a consequence of such allocation is that two threads, which might operate on logically distinct objects, in fact conflict with each other and mutually reduce their performance potential.

A possible brute-force solution to the false sharing of dynamically allocated blocks, applied at the allocator level, would be to allocate blocks that are already aligned to the size of cache line on the target system, but the required padding would be very wasteful especially with very small blocks that are often requested for linked data structures like lists or trees. What about pre-allocating whole sequences of blocks for future use within a single data structure? Interestingly, this problem is again related to the fact that allocation requests have to be isolated and could be easily solved with the allocator API that is more conscious of application high-level goals.

Possible solution: request grouping

Both excessive synchronization effort and false sharing can be addressed with request grouping, where the application can express the intent to allocate not a single memory block, but a whole sequence of them, not necessarily of the same size, but preferably closely together.

A possible API extension with regard to the existing operations would involve just a single new operation that accepts an array of requests (requested block sizes) and returns an array of addresses of allocated blocks. A complementary deallocation operation can be added as well.

The added value of such API extension can be described in the following terms:

Group allocation operation can benefit from the reduction of internal synchronization effort, since all blocks can be allocated as a single atomic action. This benefit can be particularly important when a big data structure is being built and the application has some prior knowledge of the expected number of data elements (for example nodes in the list). Another benefit from grouped allocation is that containers can pre-allocate their nodes closely together, which can help avoiding the false sharing and generally improve cache performance by virtue of better locality of reference.
Group deallocation can benefit from the reduction of internal synchronization similarly to allocation, although it has no impact on other issues.

Another problem that can be solved with such an allocator API, although independent on the multithreading issues, is that of atomic failure reporting. This problem is visible when the application builds complex data structure and hits the out of memory condition somewhere during that process (at worst, close to the end of this process). What usually happens with the traditional single-block approach is that the application rolls back the whole structure building process by tearing down everything that was done so far, which is clearly a wasted effort. With the extended allocator API it is possible for the application - provided of course that the application can predict up front the number and sizes of allocated blocks - to be notified at the very beginning that it will not be possible to allocate all of the requested blocks and therefore to avoid the wasted investment in building incomplete data structure.

Interestingly, the extended allocator API can be defensively and trivially implemented in terms of existing operations, by simple looping - even though it would not provide the performance benefits described above, such an implementation is functionally equivalent and can be a safe small step towards final solution, where the single-block operations are just special cases of the more general group variants.