On Fri, May 31, 2024 at 1:31 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Fri, May 31, 2024 at 1:24 AM Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > > > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote: > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > eviction time, but before then, we don't get accurate age information. > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > generation, and then at eviction time, if KVM reports the page to be > > > > young, the page will be activated/promoted back to the youngest > > > > generation. > > > > > > Correct, and as I explained offline, this is the only reasonable > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > Just for the record, the (crude) analogy I used was: > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > but you are only allowed to pick up 10 of them (and put them in your > > > pocket). A smart move would be to survey the room *first and then* > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > you would just want to pick up whichever that's in front of you rather > > > than walk the entire room. > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > done lockless. Otherwise, it should just fall back to the existing > > > approach, which existed in previous versions but is removed in this > > > version. > > > > Grabbing the MMU lock for write to scan sucks, no argument there. But > > can you please be specific about the impact of read lock v. RCU in the > > case of arm64? I had asked about this before and you never replied. > > > > My concern remains that adding support for software table walkers > > outside of the MMU lock entirely requires more work than just deferring > > the deallocation to an RCU callback. Walkers that previously assumed > > 'exclusive' access while holding the MMU lock for write must now cope > > with volatile PTEs. > > > > Yes, this problem already exists when hardware sets the AF, but the > > lock-free walker implementation needs to be generic so it can be applied > > for other PTE bits. > > Direct reclaim is multi-threaded and each reclaimer can take the mmu > lock for read (testing the A-bit) or write (unmapping before paging > out) on arm64. The fundamental problem of using the readers-writer > lock in this case is priority inversion: the readers have lower > priority than the writers, so ideally, we don't want the readers to > block the writers at all. > > Using my previous (crude) analogy: puting the bill right in front of > you (the writers) profits immediately whereas searching for the > largest bill (the readers) can be futile. > > As I said earlier, I prefer we drop the arm64 support for now, but I > will not object to taking the mmu lock for read when clearing the > A-bit, as long as we fully understand the problem here and document it > clearly. FWIW, Google Cloud has been doing proactive reclaim and kstaled-based aging (a Google-internal page aging daemon, for those outside of Google) for many years on x86 VMs with the A-bit harvesting under the write-lock. So I'm skeptical that making ARM64 lockless is necessary to allow Secondary MMUs to participate in MGLRU aging with acceptable performance for Cloud usecases. I don't even think it's necessary on x86 but it's a simple enough change that we might as well just do it. I suspect under pathological conditions (host under intense memory pressure and high rate of reclaim occurring) making A-bit harvesting lockless will perform better. But under such conditions VM performance is likely going to suffer regardless. In a Cloud environment we deal with that through other mechanisms to reduce the rate of reclaim and make the host healthy. For these reasons, I think there's value in giving users the option to enable Secondary MMUs participation MGLRU aging even when A-bit test/clearing is not done locklessly. I believe this was James' intent with the Kconfig. Perhaps a default-off writable module parameter would be better to avoid distros accidentally turning it on? If and when there is a usecase for optimizing VM performance under pathological reclaim conditions on ARM, we can make it lockless then.