Re: [PATCH RFC v2 00/10] SLUB percpu sheaves

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 17 Mar 2025 12:08:55 +0100

On 3/14/25 18:10, Suren Baghdasaryan wrote:
> On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>>
>> * Vlastimil Babka <vbabka@xxxxxxx> [250304 05:55]:
>> > On 2/25/25 21:26, Suren Baghdasaryan wrote:
>> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>> > >>
>> > >> >
>> > >> > > The values represent the total time it took to perform mmap syscalls, less is
>> > >> > > better.
>> > >> > >
>> > >> > > (1)                  baseline       control
>> > >> > > Little core       7.58327       6.614939 (-12.77%)
>> > >> > > Medium core  2.125315     1.428702 (-32.78%)
>> > >> > > Big core          0.514673     0.422948 (-17.82%)
>> > >> > >
>> > >> > > (2)                  baseline      control
>> > >> > > Little core       7.58327       5.141478 (-32.20%)
>> > >> > > Medium core  2.125315     0.427692 (-79.88%)
>> > >> > > Big core          0.514673    0.046642 (-90.94%)
>> > >> > >
>> > >> > > (3)                   baseline      control
>> > >> > > Little core        7.58327      4.779624 (-36.97%)
>> > >> > > Medium core   2.125315    0.450368 (-78.81%)
>> > >> > > Big core           0.514673    0.037776 (-92.66%)
>> > >
>> > > (4)                   baseline      control
>> > > Little core        7.58327      4.642977 (-38.77%)
>> > > Medium core   2.125315    0.373692 (-82.42%)
>> > > Big core           0.514673    0.043613 (-91.53%)
>> > >
>> > > I think the difference between (3) and (4) is noise.
>> > > Thanks,
>> > > Suren.
>> >
>> > Hi, as we discussed yesterday, it would be useful to set the baseline to
>> > include everything before sheaves as that's already on the way to 6.15, so
>> > we can see more clearly what sheaves do relative to that. So at this point
>> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
>> > thus like in scenario (4)), and benchmark the following:
>> >
>> > - baseline - vma locking conversion with TYPESAFE_BY_RCU
>> > - baseline+maple tree node reduction from mm-unstable (Liam might point out
>> > which patches?)
>>
>> Sid's patches [1] are already in mm-unstable.
>>
>>
>> > - the above + this series + sheaves enabled for vm_area_struct cache
>> > - the above + full maple node sheaves conversion [1]
>> > - the above + the top-most patches from [1] that are optimizations with a
>> > tradeoff (not clear win-win) so it would be good to know if they are useful
>> >
>> > [1] currently the 4 commits here:
>> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
>> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
>> > but as Liam noted, they won't cherry pick without conflict once maple tree
>> > node reduction is backported, but he's working on a rebase
>>
>> Rebased maple tree sheaves, patches are here [2].
> 
> Hi Folks,
> Sorry for the delay. I got the numbers last week but they looked a bit
> weird, so I reran the test increasing the number of iterations to make
> sure noise is not a factor. That took most of this week. Below are the
> results. Please note that I had to backport the patchsets to 6.12
> because that's the closest stable Android kernel I can use. I measure
> cumulative time to execute mmap syscalls, so the smaller the number
> the better mmap performance is:

Is that a particular benchmark doing those syscalls, or you time them within
actual workloads?

> baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU
> config1: baseline + Sid's patches [1]
> config2: sheaves RFC
> config3: config1 + vm_area_struct with sheaves
> config4: config2 + maple_tree Sheaf conversion [2]
> config5: config3 + 2 last optimization patches from [3]
> 
>                config1     config2     config3     config4     config5
> Little core    -0.10%      -10.10%     -12.89%     -10.02%     -13.64%
> Mid core       -21.05%     -37.31%     -44.97%     -15.81%     -22.15%
> Big core       -17.17%     -34.41%     -45.68%     -11.39%     -15.29%

Thanks a lot, Suren.

> [1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@xxxxxxxxxx/
> [2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
> [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> 
> From the numbers, it looks like config4 regresses the performance and
> that's what looked weird to me last week and I wanted to confirm this.
> But from sheaves POV, it looks like they provide the benefits I saw
> before. Sid's patches which I did not test separately before also look
> beneficial.

Indeed, good job, Sid. It's weird that config4 isn't doing well. The problem
can be either in sheaves side (the sheaves preallocation isn't effective) or
maple tree side doing some excessive work. It could be caused by the wrong
condition in kmem_cache_return_sheaf() that Harry pointed out, so v3 might
improve if that was it. Otherwise we'll probably need to fill the gaps in
sheaf-related stats and see what are the differences between config3 and
config4.

> Thanks,
> Suren.
> 
>>
>>
>> >
>> >
>> ...
>>
>> Thanks,
>> Liam
>>
>> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@xxxxxxxxxx/
>> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304