On Mon, Aug 12, 2024 at 12:50 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > Ok, disabling adjacent cacheline prefetching seems to do the trick (or > at least cuts down the regression drastically): > > Hmean faults/cpu-1 470577.6434 ( 0.00%) 470745.2649 * 0.04%* > Hmean faults/cpu-4 445862.9701 ( 0.00%) 445572.2252 * -0.07%* > Hmean faults/cpu-7 422516.4002 ( 0.00%) 422677.5591 * 0.04%* > Hmean faults/cpu-12 344483.7047 ( 0.00%) 330476.7911 * -4.07%* > Hmean faults/cpu-21 192836.0188 ( 0.00%) 195266.8071 * 1.26%* > Hmean faults/cpu-30 140745.9472 ( 0.00%) 140655.0459 * -0.06%* > Hmean faults/cpu-48 110507.4310 ( 0.00%) 103802.1839 * -6.07%* > Hmean faults/cpu-56 93507.7919 ( 0.00%) 95105.1875 * 1.71%* > Hmean faults/sec-1 470232.3887 ( 0.00%) 470404.6525 * 0.04%* > Hmean faults/sec-4 1757368.9266 ( 0.00%) 1752852.8697 * -0.26%* > Hmean faults/sec-7 2909554.8150 ( 0.00%) 2915885.8739 * 0.22%* > Hmean faults/sec-12 4033840.8719 ( 0.00%) 3845165.3277 * -4.68%* > Hmean faults/sec-21 3845857.7079 ( 0.00%) 3890316.8799 * 1.16%* > Hmean faults/sec-30 3838607.4530 ( 0.00%) 3838861.8142 * 0.01%* > Hmean faults/sec-48 4882118.9701 ( 0.00%) 4608985.0530 * -5.59%* > Hmean faults/sec-56 4933535.7567 ( 0.00%) 5004208.3329 * 1.43%* > > Now, how do we disable prefetching extra cachelines for vm_area_structs only? I'm unaware of any mechanism of the sort. The good news is that Broadwell is an old yeller and if memory serves right the impact is not anywhere near this bad on newer microarchitectures, making "merely" 64 alignment (used all over in the kernel for amd64) a practical choice (not just for vma). Also note that in your setup you are losing out on performance in other multithreaded cases, unrelated to anything vma. That aside as I mentioned earlier the dedicated vma lock cache results in false sharing between separate vmas, except this particular benchmark does not test for it (which in your setup should be visible even if the cache grows the SLAB_HWCACHE_ALIGN flag). I think the thing to do here is to bench on other cpus and ignore the Broadwell + adjacent cache line prefetcher result if they come back fine -- the code should not be held hostage by an old yeller. To that end I think it would be best to ask the LKP folks at Intel. They are very approachable so there should be no problem arranging it provided they have some spare capacity. I believe grabbing the From person and the cc list from this thread will do it: https://lore.kernel.org/oe-lkp/ZriCbCPF6I0JnbKi@xsang-OptiPlex-9020/ . By default they would run their own suite, which presumably has some overlap with this particular benchmark in terms of generated workload (but I don't think they run *this* particular benchmark itself, perhaps it would make sense to ask them to add it?). It's your call here. If there are still problems and the lock needs to remain separate, the bare minimum damage-controlling measure would be to hwalign the vma lock cache -- it wont affect the pts benchmark, but it should help others. Should the decision be to bring the lock back into the struct, I'll note my patch is merely slapped together to a state where it can be benchmarked and I have no interest in beating it into a committable shape. You stated you already had an equivalent (modulo keeping something in a space previously occupied by the pointer to the vma lock), so as far as I'm concerned you can submit that with your authorship. -- Mateusz Guzik <mjguzik gmail.com>