On Sun, Aug 11, 2024 at 9:29 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > On Mon, Aug 12, 2024 at 12:50 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > Ok, disabling adjacent cacheline prefetching seems to do the trick (or > > at least cuts down the regression drastically): > > > > Hmean faults/cpu-1 470577.6434 ( 0.00%) 470745.2649 * 0.04%* > > Hmean faults/cpu-4 445862.9701 ( 0.00%) 445572.2252 * -0.07%* > > Hmean faults/cpu-7 422516.4002 ( 0.00%) 422677.5591 * 0.04%* > > Hmean faults/cpu-12 344483.7047 ( 0.00%) 330476.7911 * -4.07%* > > Hmean faults/cpu-21 192836.0188 ( 0.00%) 195266.8071 * 1.26%* > > Hmean faults/cpu-30 140745.9472 ( 0.00%) 140655.0459 * -0.06%* > > Hmean faults/cpu-48 110507.4310 ( 0.00%) 103802.1839 * -6.07%* > > Hmean faults/cpu-56 93507.7919 ( 0.00%) 95105.1875 * 1.71%* > > Hmean faults/sec-1 470232.3887 ( 0.00%) 470404.6525 * 0.04%* > > Hmean faults/sec-4 1757368.9266 ( 0.00%) 1752852.8697 * -0.26%* > > Hmean faults/sec-7 2909554.8150 ( 0.00%) 2915885.8739 * 0.22%* > > Hmean faults/sec-12 4033840.8719 ( 0.00%) 3845165.3277 * -4.68%* > > Hmean faults/sec-21 3845857.7079 ( 0.00%) 3890316.8799 * 1.16%* > > Hmean faults/sec-30 3838607.4530 ( 0.00%) 3838861.8142 * 0.01%* > > Hmean faults/sec-48 4882118.9701 ( 0.00%) 4608985.0530 * -5.59%* > > Hmean faults/sec-56 4933535.7567 ( 0.00%) 5004208.3329 * 1.43%* > > > > Now, how do we disable prefetching extra cachelines for vm_area_structs only? > > I'm unaware of any mechanism of the sort. > > The good news is that Broadwell is an old yeller and if memory serves > right the impact is not anywhere near this bad on newer > microarchitectures, making "merely" 64 alignment (used all over in the > kernel for amd64) a practical choice (not just for vma). That's indeed good news if other archs are not that sensitive to this. > > Also note that in your setup you are losing out on performance in > other multithreaded cases, unrelated to anything vma. > > That aside as I mentioned earlier the dedicated vma lock cache results > in false sharing between separate vmas, except this particular > benchmark does not test for it (which in your setup should be visible > even if the cache grows the SLAB_HWCACHE_ALIGN flag). When implementing VMA locks I did experiment with SLAB_HWCACHE_ALIGN for vm_lock cache using different benchmarks and didn't see improvements above noise level. Do you know of some specific benchmark that would possibly show improvement? > > I think the thing to do here is to bench on other cpus and ignore the > Broadwell + adjacent cache line prefetcher result if they come back > fine -- the code should not be held hostage by an old yeller. That sounds like a good idea. Mel Gorman first reported this regression when I was developing VMA locks and I believe he has a farm of different machines to run mmtests on. CC'ing Mel. Mel, would you be able to run PFT tests with the patch at https://lore.kernel.org/all/20240808185949.1094891-1-mjguzik@xxxxxxxxx/ vs baseline on your farm? The goal is to see if any architecture other than Broadwell shows performance regression. > > To that end I think it would be best to ask the LKP folks at Intel. > They are very approachable so there should be no problem arranging it > provided they have some spare capacity. I believe grabbing the From > person and the cc list from this thread will do it: > https://lore.kernel.org/oe-lkp/ZriCbCPF6I0JnbKi@xsang-OptiPlex-9020/ . > By default they would run their own suite, which presumably has some > overlap with this particular benchmark in terms of generated workload > (but I don't think they run *this* particular benchmark itself, > perhaps it would make sense to ask them to add it?). It's your call > here. Thanks for the suggestion. Let's see if Mel can use his farm first and then will ask Intel folks. > > If there are still problems and the lock needs to remain separate, the > bare minimum damage-controlling measure would be to hwalign the vma > lock cache -- it wont affect the pts benchmark, but it should help > others. Sure but I'll need to measure the improvement and for that I need a banchmark or a workload. Any suggestions? > > Should the decision be to bring the lock back into the struct, I'll > note my patch is merely slapped together to a state where it can be > benchmarked and I have no interest in beating it into a committable > shape. You stated you already had an equivalent (modulo keeping > something in a space previously occupied by the pointer to the vma > lock), so as far as I'm concerned you can submit that with your > authorship. Thanks! If we end up doing that I'll keep you as Suggested-by and will add a link to this thread. Thanks, Suren. > -- > Mateusz Guzik <mjguzik gmail.com>