Re: [RFC PATCH] vm: align vma allocation and move the lock back into the struct

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 12 Aug 2024 08:27:27 -0700

On Sun, Aug 11, 2024 at 9:29 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Mon, Aug 12, 2024 at 12:50 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > Ok, disabling adjacent cacheline prefetching seems to do the trick (or
> > at least cuts down the regression drastically):
> >
> > Hmean     faults/cpu-1    470577.6434 (   0.00%)   470745.2649 *   0.04%*
> > Hmean     faults/cpu-4    445862.9701 (   0.00%)   445572.2252 *  -0.07%*
> > Hmean     faults/cpu-7    422516.4002 (   0.00%)   422677.5591 *   0.04%*
> > Hmean     faults/cpu-12   344483.7047 (   0.00%)   330476.7911 *  -4.07%*
> > Hmean     faults/cpu-21   192836.0188 (   0.00%)   195266.8071 *   1.26%*
> > Hmean     faults/cpu-30   140745.9472 (   0.00%)   140655.0459 *  -0.06%*
> > Hmean     faults/cpu-48   110507.4310 (   0.00%)   103802.1839 *  -6.07%*
> > Hmean     faults/cpu-56    93507.7919 (   0.00%)    95105.1875 *   1.71%*
> > Hmean     faults/sec-1    470232.3887 (   0.00%)   470404.6525 *   0.04%*
> > Hmean     faults/sec-4   1757368.9266 (   0.00%)  1752852.8697 *  -0.26%*
> > Hmean     faults/sec-7   2909554.8150 (   0.00%)  2915885.8739 *   0.22%*
> > Hmean     faults/sec-12  4033840.8719 (   0.00%)  3845165.3277 *  -4.68%*
> > Hmean     faults/sec-21  3845857.7079 (   0.00%)  3890316.8799 *   1.16%*
> > Hmean     faults/sec-30  3838607.4530 (   0.00%)  3838861.8142 *   0.01%*
> > Hmean     faults/sec-48  4882118.9701 (   0.00%)  4608985.0530 *  -5.59%*
> > Hmean     faults/sec-56  4933535.7567 (   0.00%)  5004208.3329 *   1.43%*
> >
> > Now, how do we disable prefetching extra cachelines for vm_area_structs only?
>
> I'm unaware of any mechanism of the sort.
>
> The good news is that Broadwell is an old yeller and if memory serves
> right the impact is not anywhere near this bad on newer
> microarchitectures, making "merely" 64 alignment (used all over in the
> kernel for amd64) a practical choice (not just for vma).

That's indeed good news if other archs are not that sensitive to this.

>
> Also note that in your setup you are losing out on performance in
> other multithreaded cases, unrelated to anything vma.
>
> That aside as I mentioned earlier the dedicated vma lock cache results
> in false sharing between separate vmas, except this particular
> benchmark does not test for it (which in your setup should be visible
> even if the cache grows the  SLAB_HWCACHE_ALIGN flag).

When implementing VMA locks I did experiment with SLAB_HWCACHE_ALIGN
for vm_lock cache using different benchmarks and didn't see
improvements above noise level. Do you know of some specific benchmark
that would possibly show improvement?

>
> I think the thing to do here is to bench on other cpus and ignore the
> Broadwell + adjacent cache line prefetcher result if they come back
> fine -- the code should not be held hostage by an old yeller.

That sounds like a good idea. Mel Gorman first reported this
regression when I was developing VMA locks and I believe he has a farm
of different machines to run mmtests on. CC'ing Mel.

Mel, would you be able to run PFT tests with the patch at
https://lore.kernel.org/all/20240808185949.1094891-1-mjguzik@xxxxxxxxx/
vs baseline on your farm? The goal is to see if any architecture other
than Broadwell shows performance regression.

>
> To that end I think it would be best to ask the LKP folks at Intel.
> They are very approachable so there should be no problem arranging it
> provided they have some spare capacity. I believe grabbing the From
> person and the cc list from this thread will do it:
> https://lore.kernel.org/oe-lkp/ZriCbCPF6I0JnbKi@xsang-OptiPlex-9020/ .
> By default they would run their own suite, which presumably has some
> overlap with this particular benchmark in terms of generated workload
> (but I don't think they run *this* particular benchmark itself,
> perhaps it would make sense to ask them to add it?). It's your call
> here.

Thanks for the suggestion. Let's see if Mel can use his farm first and
then will ask Intel folks.

>
> If there are still problems and the lock needs to remain separate, the
> bare minimum damage-controlling measure would be to hwalign the vma
> lock cache -- it wont affect the pts benchmark, but it should help
> others.

Sure but I'll need to measure the improvement and for that I need a
banchmark or a workload. Any suggestions?

>
> Should the decision be to bring the lock back into the struct, I'll
> note my patch is merely slapped together to a state where it can be
> benchmarked and I have no interest in beating it into a committable
> shape. You stated you already had an equivalent (modulo keeping
> something in a space previously occupied by the pointer to the vma
> lock), so as far as I'm concerned you can submit that with your
> authorship.

Thanks! If we end up doing that I'll keep you as Suggested-by and will
add a link to this thread.
Thanks,
Suren.

> --
> Mateusz Guzik <mjguzik gmail.com>