On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > > > > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > > > eviction time, but before then, we don't get accurate age information. > > > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > > > generation, and then at eviction time, if KVM reports the page to be > > > > > > young, the page will be activated/promoted back to the youngest > > > > > > generation. > > > > > > > > > > Correct, and as I explained offline, this is the only reasonable > > > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > > > > > Just for the record, the (crude) analogy I used was: > > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > > > but you are only allowed to pick up 10 of them (and put them in your > > > > > pocket). A smart move would be to survey the room *first and then* > > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > > > you would just want to pick up whichever that's in front of you rather > > > > > than walk the entire room. > > > > > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > > > done lockless. Otherwise, it should just fall back to the existing > > > > > approach, which existed in previous versions but is removed in this > > > > > version. > > > > > > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that > > > > don't implement a lockless walk? > > > > > > No, the existing approach only checks secondary MMUs for LRU folios, > > > i.e., those at the end of the LRU list. It might not find the best > > > candidates (the coldest ones) on the entire list, but it doesn't pay > > > as much for the locking. MGLRU can *optionally* scan MMUs (secondary > > > included) to find the best candidates, but it can only be a win if the > > > scanning incurs a relatively low overhead, e.g., done locklessly for > > > the secondary MMU. IOW, this is a balance between the cost of > > > reclaiming not-so-cold (warm) folios and that of finding the coldest > > > folios. > > > > Gotcha. > > > > I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler > > _code_, but I think it increases the overall system complexity. E.g. distros > > will likely enable the Kconfig, and in my experience people using KVM with a > > distro kernel usually aren't kernel experts, i.e. likely won't know that there's > > even a decision to be made, let alone be able to make an informed decision. > > > > Having an mmu_notifier hook that is conditionally implemented doesn't seem overly > > complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for > > KVM to nullify its mmu_notifier hook during initialization. The hardest part is > > likely going to be figuring out the threshold for how much overhead is too much. > > Hi Yu, Sean, > > Perhaps I "simplified" this bit of the series a little bit too much. > Being able to opportunistically do aging with KVM (even without > setting the Kconfig) is valuable. > > IIUC, we have the following possibilities: > - v4: aging with KVM is done if the new Kconfig is set. > - v3: aging with KVM is always done. This is not true -- in v3, MGLRU only scans secondary MMUs if it can be done locklessly on x86. It uses a bitmap to imply this requirement. > - v2: aging with KVM is done when the architecture reports that it can > probably be done locklessly, set at KVM MMU init time. Not really -- it's only done if it can be done locklessly on both x86 and arm64. > - Another possibility?: aging with KVM is only done exactly when it > can be done locklessly (i.e., mmu_notifier_test/clear_young() called > such that it will not grab any locks). This is exactly the case for v2. > I like the v4 approach because: > 1. We can choose whether or not to do aging with KVM no matter what > architecture we're using (without requiring userspace be aware to > disable the feature at runtime with sysfs to avoid regressing > performance if they don't care about proactive reclaim). > 2. If we check the new feature bit (0x8) in sysfs, we can know for > sure if aging is meant to be working or not. The selftest changes I > made won't work properly unless there is a way to be sure that aging > is working with KVM. I'm not convinced, but it doesn't mean your point of view is invalid. If you fully understand the implications of your design choice and document them, I will not object. All optimizations in v2 were measured step by step. Even that bitmap, which might be considered overengineered, brought a readily measuarable 4% improvement in memcached throughput on Altra Max swapping to Optane: Using the bitmap (64 KVM PTEs for each call) ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 0.00 --- --- --- --- --- --- 0.00 Gets 1012801.92 431436.92 14965.11 0.06246 0.04700 0.16700 4.31900 39635.83 Waits 0.00 --- --- --- --- --- --- --- Totals 1012801.92 431436.92 14965.11 0.06246 0.04700 0.16700 4.31900 39635.83 Not using the bitmap (1 KVM PTEs for each call) ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 0.00 --- --- --- --- --- --- 0.00 Gets 968210.02 412443.85 14303.89 0.06517 0.04700 0.15900 7.42300 37890.74 Waits 0.00 --- --- --- --- --- --- --- Totals 968210.02 412443.85 14303.89 0.06517 0.04700 0.15900 7.42300 37890.74 FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached. What I don't think is acceptable is simplifying those optimizations out without documenting your justifications (I would even call it a design change, rather than simplification, from v3 to v4). > For look-around at eviction time: > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed. > - v2/v3: done if the main mm PTE was young or (the SPTE was young and > the MMU notifier was lockless/fast). The host and secondary MMUs are two *independent* cases, IMO: 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young. 2. lookaround the secondary MMU if it can be done locklessly. So the v2/v3 behavior sounds a lot more reasonable to me. Also a nit -- don't use 'else' in the following case (should_look_around()): if (foo) return bar; else do_something(); > I made this logic change as part of removing batching. > > I'd really appreciate guidance on what the correct thing to do is. > > In my mind, what would work great is: by default, do aging exactly > when KVM can do it locklessly, and then have a Kconfig to always have > MGLRU to do aging with KVM if a user really cares about proactive > reclaim (when the feature bit is set). The selftest can check the > Kconfig + feature bit to know for sure if aging will be done. I still don't see how that Kconfig helps. Or why the new static branch isn't enough? > I'm not sure what the exact right thing to do for look-around is. > > Thanks for the quick feedback.
Attachment:
2.svg
Description: image/svg
Attachment:
1.svg
Description: image/svg