On Mon, Nov 22, 2021, Ben Gardon wrote: > On Mon, Nov 22, 2021 at 3:08 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Mon, Nov 22, 2021, Ben Gardon wrote: > > > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > Take TDP MMU roots off the list of roots when they're invalidated instead > > > > of walking later on to find the roots that were just invalidated. In > > > > addition to making the flow more straightforward, this allows warning > > > > if something attempts to elevate the refcount of an invalid root, which > > > > should be unreachable (no longer on the list so can't be reached by MMU > > > > notifier, and vCPUs must reload a new root before installing new SPTE). > > > > > > > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > > > There are a bunch of awesome little cleanups and unrelated fixes > > > included in this commit that could be factored out. > > > > > > I'm skeptical of immediately moving the invalidated roots into another > > > list as that seems like it has a lot of potential for introducing > > > weird races. > > > > I disagree, the entire premise of fast invalidate is that there can't be races, > > i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only > > reason leaving roots on the per-VM list doesn't have weird races is because slots_lock > > is held. If slots_lock weren't required to do a fast zap, which is feasible for the > > TDP MMU since it doesn't rely on the memslots generation, then it would be possible > > for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in > > that case, leaving roots on the per-VM list would lead to a single instance of a > > "fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se, > > but I don't like not having a clear "owner" of the invalidated root. > > That's a good point, the potential interleaving of zap_alls would be gross. > > My mental model for the invariant here was "roots that are still in > use are on the roots list," but I can see how "the roots list contains > all valid, in-use roots" could be a more useful invariant. Sadly, my idea of taking invalid roots off the list ain't gonna happen. The immediate issue is that the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. This is most noticeable with set_nx_huge_pages() + kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. The pages aren't accessible by the guest, and KVM won't read or write the data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, _after_ the mmu_notifier returns.