Hi! When looking at what is missing make librseq a generally usable project to support per-cpu data structures in user-space, I noticed that what we miss is a per-cpu memory allocator conceptually similar to what the Linux kernel internally provides [1]. The per-CPU memory allocator is analogous to TLS (Thread-Local Storage) memory: TLS is Thread-Local Storage, whereas the per-CPU memory allocator provides CPU-Local Storage. My goal is to improve locality and remove the need to waste precious cache lines with padding when indexing per-cpu data as an array of items. So we decided to go ahead and implement a per-cpu allocator for userspace in the librseq project [2,3] with the following characteristics: * Allocations are performed in memory pools (mempool). Allocations are power of 2, fixed sized, configured at pool creation. * Memory pools can be added to a pool set to allow allocation of variable size records. * Allocating "items" from a memory pool allocates memory for all CPUs. * The "stride" to index per-cpu data is user-configurable. Indexing per-cpu data from an allocated pointer is as simple as: (uintptr_t) ptr + (cpu * stride) Where the multiplication is actually a shift because stride is a power of 2 constant. * Pools consist of a linked list of "ranges" (a stride worth of item allocation), thus making the pool extensible when running out of space, up to a user-configurable limit. * Freeing a pointer only requires the pointer to free as input (and the pool stride constant). Finding the range and pool associated with the pointer is done by applying a mask to the pointer. The memory mappings of the ranges are aligned to make this mask find the range base, and thus allow accessing the range structure placed in a header page immediately before. One interesting problem we faced is what should be done to prevent wasting memory due to allocation of useless pages in a system where there are lots of configured CPUs, but very few are actually used by the application due to a combination of cpu affinity, cpusets, and cpu hotplug. Minimizing the amount of page allocation while offering the ability to allocate zeroed (or pre-initialized) items is the crux of this issue. We thus came up with two approaches based on copy-on-write (COW) to tackle this, which we call the "pool populate policy": * RSEQ_MEMPOOL_POPULATE_COW_INIT (default): Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the initial values pages on first write. The COW_INIT approach maps an extra "initial values" stride with each pool range as MAP_SHARED from a memfd. All per-cpu strides map these initial values as MAP_PRIVATE, so the first write access from an active CPU will trigger a COW page allocation. The downside of this scheme is that its use of MAP_SHARED is not compatible with using the pool from children processes after fork, and its use of COW is not compatible with shared memory use-cases. * RSEQ_MEMPOOL_POPULATE_COW_ZERO: Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the zero page on first write. As long as the user only uses malloc, zmalloc, or malloc_init with zeroed content to allocate items, it does not trigger COW of all per-cpu pages, leaving in place the zero page until an active CPU writes to its per-cpu item. The COW_ZERO approach maps the per-cpu strides as private anonymous memory, and therefore only triggers COW page allocation when a CPU writes over those zero pages. As a downside, this scheme will trigger COW page allocation for all possible CPUs when using zmalloc_init() to populate non-zeroed initial values for an item. Its upsides are that this scheme can be used across fork and eventually can be used over shared memory. Other noteworthy features are that this mempool allocator can be used as a global allocator as well. It has an optional "robust" attribute which enables checks for memory corruption and double-free. Users with more custom use-cases can register an "init" callback to be called for after each new range/cpu are allocated. Feedback is welcome ! Thanks, Mathieu [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/percpu.h [2] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/mempool.h [3] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq-mempool.c -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com