On Mon, Dec 5, 2022 at 1:02 AM kernel test robot <yujie.liu@xxxxxxxxx> wrote: > > FYI, we noticed a -53.3% regression of will-it-scale.per_thread_ops due to commit: > 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB has been flushed") Sadly, I think this may be at least partially expected. The code fundamentally moves one "loop over pages" and splits it up (with the TLB flush in between). Which can't be great for locality, but it's kind of fundamental for the fix - but some of it might be due to the batch limit logic. I wouldn't have expected it to actually show up in any real loads, but: > in testcase: will-it-scale > test: page_fault3 I assume that this test is doing a lot of mmap/munmap on dirty shared memory regions (both because of the regression, and because of the name of that test ;) So this is hopefully an extreme case. Now, it's likely that this particular case probably also triggers that /* No more batching if we have delayed rmaps pending */ which means that the loops in between the TLB flushes will be smaller, since we don't batch up as many pages as we used to before we force a TLB (and rmap) flush and free them. If it's due to that batching issue it may be fixable - I'll think about this some more, but > Details are as below: The bug it fixes ends up meaning that we run that rmap removal code _after_ the TLB flush, and it looks like this (probably combined with the batching limit) then causes some nasty iTLB load issues: > 2291312 ą 2% +1452.8% 35580378 ą 4% perf-stat.i.iTLB-loads although it also does look like it's at least partly due to some irq timing issue (and/or bad NUMA/CPU migration luck): > 388169 +267.4% 1426305 ą 6% vmstat.system.in > 161.37 +84.9% 298.43 ą 6% perf-stat.ps.cpu-migrations > 172442 ą 4% +26.9% 218745 ą 8% perf-stat.ps.node-load-misses so it might be that some of the regression comes down to "bad luck" - it happened to run really nicely on that particular machine, and then the timing changes caused some random "phase change" to the load. The profile doesn't actually seem to show all that much more IPI overhead, so maybe these incidental issues are what then causes that big regression. It would be lovely to hear if you see this on other machines and/or loads. Because if it's a one-off, it's probably best ignored. If it shows up elsewhere, I think that batching logic might need looking at. Linus