On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote: > On 3/12/21 11:57 PM, Yu Zhao wrote: > > Some architectures support the accessed bit on non-leaf PMD entries > > (parents) in addition to leaf PTE entries (children) where pages are > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > as part of linear-address translation [1]. Page table walkers who are > > interested in the accessed bit on children can take advantage of this: > > they do not need to search the children when the accessed bit is not > > set on a parent, given that they have previously cleared the accessed > > bit on this parent in addition to its children. > > I'd like to hear a *LOT* more about how this is going to be used. > > The one part of this which is entirely missing is the interaction with > the TLB and mid-level paging structure caches. The CPU is pretty > aggressive about setting no-leaf accessed bits when TLB entries are > created. This *looks* to be depending on that behavior, but it would be > nice to spell it out explicitly. Good point. Let me start with a couple of observations we've made: 1) some applications create very sparse address spaces, for various reasons. A notable example is those using Scudo memory allocations: they usually have double-digit numbers of PTE entries for each PMD entry (and thousands of VMAs for just a few hundred MBs of memory usage, sigh...). 2) scans of an address space (from the reclaim path) are much less frequent than context switches of it. Under our heaviest memory pressure (30%+ overcommitted; guess how much we've profited from it :) ), their magnitudes are still on different orders. Specifically, on our smallest system (2GB, with PCID), we observed no difference between flushing and not flushing TLB in terms of page selections. We actually observed more TLB misses under heavier memory pressure, and our theory is that this is due to increased memory footprint that causes the pressure. There are two use cases for the accessed bit on non-leaf PMD entries: the hot tracking and the cold tracking. I'll focus on the cold tracking, which is what this series about. Since non-leaf entries are more likely to be cached, in theory, the false negative rate is higher compared with leaf entries as the CPU won't set the accessed bit again until the next TLB miss. (Here a false negative means the accessed bit isn't set on an entry has been used, after we cleared the accessed bit. And IIRC, there are also false positives, i.e., the accessed bit is set on entries used by speculative execution only.) But this is not a problem because of the second observation aforementioned. Now let's consider the worst case scenario: what happens when we hit a false negative on a non-leaf PMD entry? We think the pages mapped by the PTE entries of this PMD entry are inactive and try to reclaim them, until we see the accessed bit set on one of the PTE entries. This will cost us one futile attempt for all the 512 PTE entries. A glance at lru_gen_scan_around() in the 11th patch would explain exactly why. If you are guessing that function embodies the same idea of "fault around", you are right. And there are two places that could benefit from this patch (and the next) immediately, independent to this series. One is clear_refs_test_walk() in fs/proc/task_mmu.c. The other is madvise_pageout_page_range() and madvise_cold_page_range() in mm/madvise.c. Both are page table walkers that clear the accessed bit. I think I've covered a lot of ground but I'm sure there is a lot more. So please feel free to add and I'll include everything we discuss here in the next version.