On Mon, Dec 5, 2022 at 6:03 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > > > > I assume that this test is doing a lot of mmap/munmap on dirty shared > > memory regions (both because of the regression, and because of the > > name of that test ;) > > I have checked the source code of will-it-scale/page_fault3. Yes, it > exactly does that. Heh. I took a look at that test-case, and yeah, it's just doing a 128MB shared mapping, dirtying it one page at a time, and unmapping it in a loop. It doesn't even look like a very good benchmark for that, because the _first_ time around the loop it does it is very different in that it has to actually create the file extents. So that benchmark starts out testing something different than what the steady state is. But yeah, that's pretty much the worst possible case for this all, and yes, I suspect it's more about the TLB batching than anything else. And I think I see the issue. We actually have a reasonably big batch size most of the time, but this benchmark triggers that dirty shared page logic on every page, and that in turn means that we stop batching immediately - even when we only have the initial tiny on-stack batch. So instead of batching MAX_GATHER_BATCH pages at a time (roughly 500 pages per go), we end up batching just eight pages (MMU_GATHER_BUNDLE) at a time. I didn't think of that degenerate case. Let me think about this a while, but I think I'll have a patch for you to test once I've dealt with a couple more pull requests. Linus