On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > freeing, which is normally a local double cmpxchg only for a short > > term allocations (so the same slab is still active on the same cpu when > > freeing the object) and a more costly locked double cmpxchg otherwise. > > The downside is the lack of NUMA locality guarantees for the allocated > > objects. > > Is that really cheaper than a local non locked double cmpxchg? Don't know about this particular part but testing sheaves with maple node cache and stress testing mmap/munmap syscalls shows performance benefits as long as there is some delay to let kfree_rcu() do its job. I'm still gathering results and will most likely post them tomorrow. > > Especially if you now have to use pushf/popf... > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > when full. After the grace period, the sheaf can be used for > > allocations, which is more efficient than freeing and reallocating > > individual slab objects (even with the batching done by kfree_rcu() > > implementation itself). In case only some cpus are allowed to handle rcu > > callbacks, the sheaf can still be made available to other cpus on the > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > thus can benefit from this. > > Have you looked at fs/bcachefs/rcu_pending.c?