On Tue, Jul 9, 2024 at 10:49 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Jul 08, 2024, James Houghton wrote: > > On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > > > > > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > I still don't think we should get rid of the WAS_FAST stuff. > > I do :-) > > > The assumption that the L1 VM will almost never share pages between L2 > > VMs is questionable. The real question becomes: do we care to have > > accurate age information for this case? I think so. > > I think you're conflating two different things. WAS_FAST isn't about accuracy, > it's about supporting lookaround in conditionally fast secondary MMUs. > > Accuracy only comes into play when we're talking about the last-minute check, > which, IIUC, has nothing to do with WAS_FAST because any potential lookaround has > already been performed. Sorry, I thought you meant: have the MMU notifier only ever be lockless (when tdp_mmu_enabled), and just return a potentially wrong result in the unlikely case that L1 is sharing pages between L2s. I think it's totally fine to just drop WAS_FAST. So then we can either do look-around (1) always, or (2) only when there is a secondary MMU with has_fast_aging. (2) is pretty simple, I'll just do that. We can add some shadow MMU lockless support later to make the look-around not as useless for the nested TDP case. > > It's not completely trivial to get the lockless walking of the shadow > > MMU rmaps correct either (please see the patch I attached here[1]). > > Heh, it's not correct. Invoking synchronize_rcu() in kvm_mmu_commit_zap_page() > is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep(). > > For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep). KVM might > read garbage, but that would be an _extremely_ rare scenario, and reporting a > zapped page as being young is acceptable in that 1 in a billion situation. > > For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM can > handle that by rechecking the rmap and using CMPXCHG to write the SPTE. If the > rmap is unchanged, then the old SPTE value is guaranteed to be valid, in the sense > that its value most definitely came from a KVM shadow page table. Ah, drat, that > won't work, because very theoretically, the page table could be freed, reallocated, > and rewritten with the exact same value by something other than KVM. Hrm. > > Looking more closely, I think we can go straight to supporting rmap walks outside > of mmu_lock. There will still be a "lock", but it will be a *very* rudimentary > lock, akin to the TDP MMU's REMOVED_SPTE approach. Bit 0 of rmap_head->val is > used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the > pointer (to a SPTE or a list). That means bit 1 is available for shenanigans. > > If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely walk the > entire rmap chain. And with a reader/write scheme, the rmap walks that are > performed under mmu_lock don't need to lock the rmap, which means flows like > kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive > self-deadlock. Lastly, the locking can be conditioned on the rmap being valid, > i.e. having at least one SPTE. That way the common case of a gfn not having any > rmaps is a glorified nop. > > Adding the locking isn't actually all that difficult, with the *huge* caveat that > the below patch is compile-tested only. The vast majority of the churn is to make > it so existing code ignores the new KVM_RMAP_LOCKED bit. This is very interesting, thanks for laying out how this could be done. I don't want to hold this series up on getting the details of the shadow MMU lockless walk exactly right. :) > I don't know that we should pursue such an approach in this series unless we have > to. E.g. if we can avoid WAS_FAST or don't have to carry too much intermediate > complexity, then it'd probably be better to land the TDP MMU support first and > then add nested TDP support later. Agreed! > At the very least, it does make me more confident that a fast walk of the rmaps > is very doable (at least for nested TDP), i.e. makes me even more steadfast > against adding WAS_FAST. > > > And the WAS_FAST functionality isn't even that complex to begin with. > > I agree the raw code isn't terribly complex, but it's not trivial either. And the > concept and *behavior* is complex, which is just as much of a maintenance burden > as the code itself. E.g. it requires knowing that KVM has multiple MMUs buried > behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger > lookaround on the fast MMU, but not the slow MMU. Understanding and describing > the implications of that behavior isn't easy. E.g. if GFN=X is young in the TDP > MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and making > decisions based purely on the TDP MMU state the "right" behavior? > > I also really don't like bleeding KVM details into the mmu_nofitier APIs. The > need for WAS_FAST is 100% a KVM limitation. AFAIK, no other secondary MMU has > multiple MMU implementations active behind a single notifier, and other than lack > of support, nothing fundamentally prevents a fast query in the shadow MMU. Makes sense. So in v6, I will make the following changes: 1. Drop the WAS_FAST complexity. 2. Add a function like mm_has_fast_aging_notifiers(), use that to determine if we should be doing look-around. 3. Maybe change the notifier calls slightly[1], still need to check performance. Does that sound good to you? Thanks! [1]: https://lore.kernel.org/linux-mm/CAOUHufb2f_EwHY5LQ59k7Nh7aS1-ZbOKtkoysb8BtxRNRFMypQ@xxxxxxxxxxxxxx/