Hi, This is a RFC to add an opt-in percpu array-based caching layer to SLUB. The name "sheaf" was invented by Matthew so we don't call it magazine like the original Bonwick paper. The per-NUMA-node cache of sheaves is thus called "barn". This may seem similar to the arrays in SLAB, but the main differences are: - opt-in, not used for every cache - does not distinguish NUMA locality, thus no "alien" arrays that would need periodical flushing - improves kfree_rcu() handling - API for obtaining a preallocated sheaf that can be used for guaranteed and efficient allocations in a restricted context, when upper bound is known but rarely reached The motivation comes mainly from the ongoing work related to VMA scalability and the related maple tree operations. This is why maple tree node and vma caches are sheaf-enabled in the RFC. Performance benefits were measured by Suren in preliminary non-public versions. A sheaf-enabled cache has the following expected advantages: - Cheaper fast paths. For allocations, instead of local double cmpxchg, with Patch 5 it's preempt_disable() and no atomic operations. Same for freeing, which is normally a local double cmpxchg only for a short term allocations (so the same slab is still active on the same cpu when freeing the object) and a more costly locked double cmpxchg otherwise. The downside is lack of NUMA locality guarantees for the allocated objects. I hope this scheme will also allow (non-guaranteed) slab allocations in context where it's impossible today and achieved by building caches on top of slab, i.e. the BPF allocator. - kfree_rcu() batching. kfree_rcu() will put objects to a separate percpu sheaf and only submit the whole sheaf to call_rcu() when full. After the grace period, the sheaf can be used for allocations, which is more efficient than handling individual slab objects (even with the batching done by kfree_rcu() implementation itself). In case only some cpus are allowed to handle rcu callbacks, the sheaf can still be made available to other cpus on the same node via the shared barn. Both maple_node and vma caches can benefit from this. - Preallocation support. A prefilled sheaf can be borrowed for a short term operation that is not allowed to block and may need to allocate some objects. If an upper bound (worst case) for the number of allocations is known, but only much fewer allocations actually needed on average, borrowing and returning a sheaf is much more efficient then a bulk allocation for the worst case followed by a bulk free of the many unused objects. Maple tree write operations should benefit from this. Patch 1 implements the basic sheaf functionality and using local_lock_irqsave() for percpu sheaf locking. Patch 2 adds the kfree_rcu() support. Patches 3 and 4 enable sheaves for maple tree nodes and vma's. Patch 5 replaces the local_lock_irqsave() locking with a cheaper scheme inspired by online conversations with Mateusz Guzik and Jann Horn. In the past I have tried to copy the scheme from page allocator's pcplists that also avoids disabling irqs by using a trylock for operations that might be attempted from an irq handler conext. But spin locks used for pcplists are more costly than a simple flag with only compiler barriers. On the other hand it's not possible to take the lock from a different cpu (except for hotplug handling when the actual local cpu cannot race with us), but we don't need that remote locking for sheaves. Patch 6 implements borrowing prefilled sheaf, with maple tree being the ancticipated user once converted to use it by someone more knowledgeable than myself. (RFC) LIMITATIONS: - with slub_debug enabled, objects in sheaves are considered allocated so allocation/free stacktraces may become imprecise and checking of e.g. redzone violations may be delayed - kfree_rcu() via sheaf is only hooked to tree rcu, not tiny rcu. Also in case we fail to allocate a sheaf, and fallback to the existing implementation, it may use kfree_bulk() where destructors are not hooked. It's however possible we won't need the destructor support for now at all if vma_lock is moved to vma itself [1] and if it's possible to free anon_name and numa balancing tracking immediately and not after a grace period. - in case a prefilled sheaf is requested with more objects than the cache's sheaf_capacity, it will fail. This should be possible to handle by allocating a bigger sheaf and then freeing it when returned, to avoid mixing up different sizes. Ineffective, but acceptable if very rare. [1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@xxxxxxxxxx/ Vlastimil git branch: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v1r5 --- Vlastimil Babka (6): mm/slub: add opt-in caching layer of percpu sheaves mm/slub: add sheaf support for batching kfree_rcu() operations maple_tree: use percpu sheaves for maple_node_cache mm, vma: use sheaves for vm_area_struct cache mm, slub: cheaper locking for percpu sheaves mm, slub: sheaf prefilling for guaranteed allocations include/linux/slab.h | 60 +++ kernel/fork.c | 27 +- kernel/rcu/tree.c | 8 +- lib/maple_tree.c | 11 +- mm/slab.h | 27 + mm/slab_common.c | 8 +- mm/slub.c | 1427 ++++++++++++++++++++++++++++++++++++++++++++++++-- 7 files changed, 1503 insertions(+), 65 deletions(-) --- base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623 change-id: 20231128-slub-percpu-caches-9441892011d7 Best regards, -- Vlastimil Babka <vbabka@xxxxxxx>