On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet > <kent.overstreet@xxxxxxxxx> wrote: > > > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > > freeing, which is normally a local double cmpxchg only for a short > > > term allocations (so the same slab is still active on the same cpu when > > > freeing the object) and a more costly locked double cmpxchg otherwise. > > > The downside is the lack of NUMA locality guarantees for the allocated > > > objects. > > > > Is that really cheaper than a local non locked double cmpxchg? > > Don't know about this particular part but testing sheaves with maple > node cache and stress testing mmap/munmap syscalls shows performance > benefits as long as there is some delay to let kfree_rcu() do its job. > I'm still gathering results and will most likely post them tomorrow. Here are the promised test results: First I ran an Android app cycle test comparing the baseline against sheaves used for maple tree nodes (as this patchset implements). I registered about 3% improvement in app launch times, indicating improvement in mmap syscall performance. Next I ran an mmap stress test which maps 5 1-page readable file-backed areas, faults them in and finally unmaps them, timing mmap syscalls. Repeats that 200000 cycles and reports the total time. Average of 10 such runs is used as the final result. 3 configurations were tested: 1. Sheaves used for maple tree nodes only (this patchset). 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. This patchset avoids allocating additional vm_lock structure on each mmap syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. The values represent the total time it took to perform mmap syscalls, less is better. (1) baseline control Little core 7.58327 6.614939 (-12.77%) Medium core 2.125315 1.428702 (-32.78%) Big core 0.514673 0.422948 (-17.82%) (2) baseline control Little core 7.58327 5.141478 (-32.20%) Medium core 2.125315 0.427692 (-79.88%) Big core 0.514673 0.046642 (-90.94%) (3) baseline control Little core 7.58327 4.779624 (-36.97%) Medium core 2.125315 0.450368 (-78.81%) Big core 0.514673 0.037776 (-92.66%) Results in (3) vs (2) indicate that using sheaves for vm_area_struct yields slightly better averages and I noticed that this was mostly due to sheaves results missing occasional spikes that worsened TYPESAFE_BY_RCU averages (the results seemed more stable with sheaves). [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@xxxxxxxxxx/ > > > > > Especially if you now have to use pushf/popf... > > > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > > when full. After the grace period, the sheaf can be used for > > > allocations, which is more efficient than freeing and reallocating > > > individual slab objects (even with the batching done by kfree_rcu() > > > implementation itself). In case only some cpus are allowed to handle rcu > > > callbacks, the sheaf can still be made available to other cpus on the > > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > > thus can benefit from this. > > > > Have you looked at fs/bcachefs/rcu_pending.c?