Re: [PATCH RFC v2 00/10] SLUB percpu sheaves

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Sun, 23 Feb 2025 17:36:27 -0800

On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet
> <kent.overstreet@xxxxxxxxx> wrote:
> >
> > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> > > - Cheaper fast paths. For allocations, instead of local double cmpxchg,
> > >   after Patch 5 it's preempt_disable() and no atomic operations. Same for
> > >   freeing, which is normally a local double cmpxchg only for a short
> > >   term allocations (so the same slab is still active on the same cpu when
> > >   freeing the object) and a more costly locked double cmpxchg otherwise.
> > >   The downside is the lack of NUMA locality guarantees for the allocated
> > >   objects.
> >
> > Is that really cheaper than a local non locked double cmpxchg?
>
> Don't know about this particular part but testing sheaves with maple
> node cache and stress testing mmap/munmap syscalls shows performance
> benefits as long as there is some delay to let kfree_rcu() do its job.
> I'm still gathering results and will most likely post them tomorrow.

Here are the promised test results:

First I ran an Android app cycle test comparing the baseline against sheaves
used for maple tree nodes (as this patchset implements). I registered about
3% improvement in app launch times, indicating improvement in mmap syscall
performance.
Next I ran an mmap stress test which maps 5 1-page readable file-backed
areas, faults them in and finally unmaps them, timing mmap syscalls.
Repeats that 200000 cycles and reports the total time. Average of 10 such
runs is used as the final result.
3 configurations were tested:

1. Sheaves used for maple tree nodes only (this patchset).

2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
This patchset avoids allocating additional vm_lock structure on each mmap
syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.

3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.

The values represent the total time it took to perform mmap syscalls, less is
better.

(1)                  baseline       control
Little core       7.58327       6.614939 (-12.77%)
Medium core  2.125315     1.428702 (-32.78%)
Big core          0.514673     0.422948 (-17.82%)

(2)                  baseline      control
Little core       7.58327       5.141478 (-32.20%)
Medium core  2.125315     0.427692 (-79.88%)
Big core          0.514673    0.046642 (-90.94%)

(3)                   baseline      control
Little core        7.58327      4.779624 (-36.97%)
Medium core   2.125315    0.450368 (-78.81%)
Big core           0.514673    0.037776 (-92.66%)

Results in (3) vs (2) indicate that using sheaves for vm_area_struct
yields slightly better averages and I noticed that this was mostly due
to sheaves results missing occasional spikes that worsened
TYPESAFE_BY_RCU averages (the results seemed more stable with
sheaves).

[1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@xxxxxxxxxx/

>
> >
> > Especially if you now have to use pushf/popf...
> >
> > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
> > >   separate percpu sheaf and only submit the whole sheaf to call_rcu()
> > >   when full. After the grace period, the sheaf can be used for
> > >   allocations, which is more efficient than freeing and reallocating
> > >   individual slab objects (even with the batching done by kfree_rcu()
> > >   implementation itself). In case only some cpus are allowed to handle rcu
> > >   callbacks, the sheaf can still be made available to other cpus on the
> > >   same node via the shared barn. The maple_node cache uses kfree_rcu() and
> > >   thus can benefit from this.
> >
> > Have you looked at fs/bcachefs/rcu_pending.c?