On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote: > On Wed, May 24, 2023, Peter Xu wrote: > > On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote: > > > As for the flags vs. bools debate (see link above), I think the best approach is > > > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use > > > booleans for outputs. I don't _think_ there are any input bools/flags that don't > > > map 1:1 with existing FOLL_* flags. > > > > > > As a very, *very* rough sketch, provide APIs that look a bit like this. > > > > Unifying ref vs nonref cases does look a bit cleaner to me too. > > > > > > > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) > > > { > > > kvm_pfn_t pfn; > > > > > > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq)) > > > > IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial > > mmu_seq being zero, or avoid overflows? > > I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid > overflow. Yeah, that's fine too. > > > I'd say we can stick with FOLL_GET in this case to identify ref vs nonref > > and always assume mmu_seq a pure random number. > > The intent of checking mmu_seq is to flag cases where the caller doesn't specify > FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the > mmu_notifiers. I.e. this is a sanity check, not functionally necessary. > > > > > > return KVM_PFN_ERR_FAULT; > > > > > > pfn = ???; > > > > > > if (foll->page && !(foll->flags & FOLL_GET)) > > > put_page(foll->page); > > > > > > return pfn; > > > } > > > > > > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page) > > > { > > > struct kvm_follow_pfn foll = { > > > .flags = FOLL_GET | FOLL_WRITE, > > > }; > > > > > > <more stuff here?> > > > > > > foll.slot = ???; > > > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID) > > > return KVM_HVA_ERR_BAD; > > > > > > if (memslot_is_readonly(foll.slot)) > > > return KVM_HVA_ERR_RO_BAD; > > > > > > return __kvm_follow_pfn(&foll); > > > } > > > > > > and a few partially converted users > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 67e2ac799aa7..5eaf0395ed87 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) > > > > > > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { > > > flush = true; > > > - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); > > > + if (is_refcounted_page_pte(old_spte)) > > > > One question is how to impl is_refcounted_page_pte() here to identify > > non-refcountable pages. > > KVM would use a software available bit in its PTEs to explicitly track which SPTEs > point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit > paging. PAE paging doesn't have high available bits, which is why I called out > that this would have to be 64-bit only. > > > IIUC those pages are mostly identical to a normal page (so !PG_reserved) > > but it has page_ref_count(page)==0 always, am I right? I got that roughly > > from reading f8be156be1 only though, so I could miss a lot of things.. > > > > When thinking about that, I'm also wondering whether we can trivially allow > > kvm to support such mapping (without overhaul of the kvm pfn API) by > > something like this: > > > > ===8<=== > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 51e4882d0873..467acbac1a96 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn) > > > > page = pfn_to_page(pfn); > > if (!PageReserved(page)) > > - return page; > > + /* > > + * When page_ref_count(page)==0 it might be speical page > > + * that do not support refcounting. Treating them the same > > + * as normal reserved (e.g. MMIO) pages by returning NULL, > > + * so they're exempt of refcounting. > > + */ > > + return page_ref_count(page) == 0 ? NULL : page; > > Heh, because I got burned by this recently, using page_ref_count() is wrong. This > needs to be page_count() so that tail pages of refcounted compound pages are > properly identified. :-D Actually when I was replying I explicitly didn't use page_count() to make sure we're reading the tail page, but I just noticed that's exactly the way how we identify the special page with a PageCompound()==true tail page. Yeah, if we'd like that it needs to be page_count()==0. > > > > > /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */ > > if (is_zero_pfn(pfn)) > > ===8<=== > > > > So that we treat those special pages the same as normal PFNMAP ones by > > skipping all refcountings on inc/dec. This is based on the fact that kvm > > should always hold at least 1 ref on a normal page so a normal page should > > never hit ref==0 here, but again I could miss something somewhere.. > > This would "work" from a functionality perspective, and might be acceptable as an > out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on > this heuristic on the backend in KVM because it will suppress any and all > use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to > go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning > on posting the below (which is how I learned about page_count() vs. page_ref_count()). > > Today, KVM gets partial protection from check_new_page_bad(), which detects *some* > cases where KVM marks a page dirty after the page is freed. But it's racy, and > the detection occurs well after the fact since it fires only when the page is > re-allocated. > > If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost > because KVM would drop its assertions and also skip dirtying pages, i.e. would > effectively suppress the latent detection by check_new_page_bad(). So it's probably that I totally have no idea what are the attributes for those special pages so I don't understand enough on why we need to handle those pages differently from e.g. PFNMAP pages, and also the benefits. I think what I can tell is that they're pages that doesn't have PageCompound bits set on either head or tails, however it's still a multi-2-order large page. Is there an example on how these pages are used and allocated? Why would we need those pages, and whether these pages need to be set dirty/accessed after all? > > Author: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Wed May 17 13:26:54 2023 -0700 > > KVM: Assert that a page's refcount is elevated when marking accessed/dirty > > Assert that a page's refcount is elevated, i.e. that _something_ holds a > reference to the page, when KVM marks a page as accessed and/or dirty. > KVM typically doesn't hold a reference to pages that are mapped into the > guest, e.g. to allow page migration, compaction, swap, etc., and instead > relies on mmu_notifiers to react to changes in the primary MMU. > > Incorrect handling of mmu_notifier events (or similar mechanisms) can > result in KVM keeping a mapping beyond the lifetime of the backing page, > i.e. can (and often does) result in use-after-free. Yelling if KVM marks > a freed page as accessed/dirty doesn't prevent badness as KVM usually > only does A/D updates when unmapping memory from the guest, i.e. the > assertion fires well after an underlying bug has occured, but yelling > does help detect, triage, and debug use-after-free bugs. > > Note, the assertion must use page_count(), NOT page_ref_count()! For > hugepages, the returned struct page may be a tailpage and thus not have > its own refcount. > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d1abb331ea68..64f18697096c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); > > static bool kvm_is_ad_tracked_page(struct page *page) > { > + /* > + * Assert that KVM isn't attempting to mark a freed page as Accessed or > + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM > + * (typically) doesn't pin pages that are mapped in KVM's MMU, and > + * instead relies on mmu_notifiers to know when a mapping needs to be > + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_ > + * KVM returns from its mmu_notifier, i.e. the page should have an > + * elevated refcount at this point even though KVM doesn't hold a > + * reference of its own. > + */ > + if (WARN_ON_ONCE(!page_count(page))) > + return false; > + > /* > * Per page-flags.h, pages tagged PG_reserved "should in general not be > * touched (e.g. set dirty) except by its owner". > This looks like a good thing to have, indeed. But again it doesn't seem like anything special to the pages we're discussing here, say, !Compound && refcount==0 ones. -- Peter Xu