On Sun, Feb 23, 2025 at 5:36 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet > > <kent.overstreet@xxxxxxxxx> wrote: > > > > > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > > > freeing, which is normally a local double cmpxchg only for a short > > > > term allocations (so the same slab is still active on the same cpu when > > > > freeing the object) and a more costly locked double cmpxchg otherwise. > > > > The downside is the lack of NUMA locality guarantees for the allocated > > > > objects. > > > > > > Is that really cheaper than a local non locked double cmpxchg? > > > > Don't know about this particular part but testing sheaves with maple > > node cache and stress testing mmap/munmap syscalls shows performance > > benefits as long as there is some delay to let kfree_rcu() do its job. > > I'm still gathering results and will most likely post them tomorrow. > > Here are the promised test results: > > First I ran an Android app cycle test comparing the baseline against sheaves > used for maple tree nodes (as this patchset implements). I registered about > 3% improvement in app launch times, indicating improvement in mmap syscall > performance. > Next I ran an mmap stress test which maps 5 1-page readable file-backed > areas, faults them in and finally unmaps them, timing mmap syscalls. I forgot to mention that I also added a 500us delay after each cycle described above to give kfree_rcu() a chance to run. > Repeats that 200000 cycles and reports the total time. Average of 10 such > runs is used as the final result. > 3 configurations were tested: > > 1. Sheaves used for maple tree nodes only (this patchset). > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. > This patchset avoids allocating additional vm_lock structure on each mmap > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. > > The values represent the total time it took to perform mmap syscalls, less is > better. > > (1) baseline control > Little core 7.58327 6.614939 (-12.77%) > Medium core 2.125315 1.428702 (-32.78%) > Big core 0.514673 0.422948 (-17.82%) > > (2) baseline control > Little core 7.58327 5.141478 (-32.20%) > Medium core 2.125315 0.427692 (-79.88%) > Big core 0.514673 0.046642 (-90.94%) > > (3) baseline control > Little core 7.58327 4.779624 (-36.97%) > Medium core 2.125315 0.450368 (-78.81%) > Big core 0.514673 0.037776 (-92.66%) > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > yields slightly better averages and I noticed that this was mostly due > to sheaves results missing occasional spikes that worsened > TYPESAFE_BY_RCU averages (the results seemed more stable with > sheaves). > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@xxxxxxxxxx/ > > > > > > > > > Especially if you now have to use pushf/popf... > > > > > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > > > when full. After the grace period, the sheaf can be used for > > > > allocations, which is more efficient than freeing and reallocating > > > > individual slab objects (even with the batching done by kfree_rcu() > > > > implementation itself). In case only some cpus are allowed to handle rcu > > > > callbacks, the sheaf can still be made available to other cpus on the > > > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > > > thus can benefit from this. > > > > > > Have you looked at fs/bcachefs/rcu_pending.c?