On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Sep 05, 2023, David Stevens wrote: > > On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Tue, Jul 11, 2023, Zhi Wang wrote: > > > > On Thu, 6 Jul 2023 15:49:39 +0900 > > > > David Stevens <stevensd@xxxxxxxxxxxx> wrote: > > > > > > > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@xxxxxxxxx> wrote: > > > > > > > > > > > > On Tue, 4 Jul 2023 16:50:48 +0900 > > > > > > David Stevens <stevensd@xxxxxxxxxxxx> wrote: > > > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page? > > > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a > > > > > > temporary solution. I don't think it is a good idea to play tricks with > > > > > > a temporary solution, more like we are abusing the toleration. > > > > > > > > > > I'm not sure I understand what you're getting at. This series never > > > > > calls gup without FOLL_GET. > > > > > > > > > > This series aims to provide kvm_follow_pfn as a unified API on top of > > > > > gup+follow_pte. Since one of the major clients of this API uses an mmu > > > > > notifier, it makes sense to support returning a pfn without taking a > > > > > reference. And we indeed need to do that for certain types of memory. > > > > > > > > > > > > > I am not having prob with taking a pfn without taking a ref. I am > > > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking > > > > a pfn without a ref is a good idea or not, while there is another flag > > > > actually showing it. > > > > > > > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some > > > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP > > > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the > > > > requirements of GUP in the code path that going to call GUP is reasonable. > > > > > > > > But using FOLL_XXX with purposes that are not related to GUP call really > > > > feels off. > > > > > > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped() > > > that handles non-refcounted pages, i.e. this > > > > > > if (get_page_unless_zero(page)) { > > > foll->is_refcounted_page = true; > > > if (!(foll->flags & FOLL_GET)) > > > put_page(page); > > > } else if (foll->flags & FOLL_GET) { > > > r = -EFAULT; > > > } > > > > > > should be > > > > > > if (get_page_unless_zero(page)) { > > > foll->is_refcounted_page = true; > > > if (!(foll->flags & FOLL_GET)) > > > put_page(page); > > > else if (!foll->guarded_by_mmu_notifier) > > > r = -EFAULT; > > > > > > because it's not the desire to grab a reference that makes getting non-refcounted > > > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers. > > > > > > Though that highlights that checking guarded_by_mmu_notifier should be done for > > > *all* non-refcounted pfns, not just non-refcounted struct page memory. > > > > I think things are getting confused here because there are multiple > > things which "safe" refers to. There are three different definitions > > that I think are relevant here: > > > > 1) "safe" in the sense that KVM doesn't corrupt page reference counts > > 2) "safe" in the sense that KVM doesn't access pfns after they have been freed > > 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations > > > > For property 1, FOLL_GET is important. If the caller passes FOLL_GET, > > then they expect to be able to pass the returned pfn to > > kvm_release_pfn. This means that when FOLL_GET is set, if > > kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped > > must take a reference count to avoid eventually corrupting the page > > ref count. I guess replacing the FOLL_GET check with > > !guarded_by_mmu_notifier is logically equivalent because > > __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier > > and FOLL_GET is set. But since we're concerned about a property of the > > refcount, I think that checking FOLL_GET is clearer. > > > > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier > > is set, then we're all good here. If guarded_by_mmu_notifier is not > > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is > > set. For struct page memory, we're safe because KVM will hold a > > reference as long as it's still using the page. For non struct page > > memory, we're not safe - this is where the breaking change of > > allow_unsafe_mappings would go. Note that for non-refcounted struct > > page, we can't use the allow_unsafe_mappings escape hatch. Since > > FOLL_GET was requested, if we returned such a page, then the caller > > would eventually corrupt the page refcount via kvm_release_pfn. > > Yes we can. The caller simply needs to be made aware of is_refcounted_page. I > didn't include that in the snippet below because I didn't want to write the entire > patch. The whole point of adding is_refcounted_page is so that callers can > identify exactly what type of page was at the end of the trail that was followed. Are you asking me to completely migrate every caller of any gfn_to_pfn variant to __kvm_follow_pfn, so that they can respect is_refcounted_page? That's the only way to make it safe for allow_unsafe_mappings to apply to non-refcounted pages. That is decidedly not simple. Or is kvm_vcpu_map the specific call site you care about? At best, I can try to migrate x86, and then just add some sort of compatibility shim for other architectures that rejects non-refcounted pages. > > Property 3 would be nice, but we've already concluded that guarding > > all translations with mmu notifiers is infeasible. So maintaining > > property 2 is the best we can hope for. > > No, #3 is just a variant of #2. Unless you're talking about not making guarantees > about guest accesses being ordered with respect to VMA/memslot updates, but I > don't think that's the case. I'm talking about the fact that kvm_vcpu_map is busted with respect to updates to VMA updates. It won't corrupt host memory because the mapping keeps a reference to the page, but it will continue to use stale translations. From [1], it sounds like you've granted that fixing that is not feasible, so I just wanted to make sure that this isn't the "unsafe" referred to by allow_unsafe_mappings. [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@xxxxxxxxxx/ > > > As for the other usage of FOLL_GET in this series (using it to conditionally do > > > put_page()), IMO that's very much related to the GUP call. Invoking put_page() > > > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn > > > without grabbing a reference to the page. In an ideal world, KVM would NOT pass > > > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM > > > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference. > > > > > > I do think it's worth providing a helper to consolidate and document that hacky > > > code, e.g. add a kvm_follow_refcounted_pfn() helper. > > > > > > All in all, I think the below (completely untested) is what we want? > > > > > > David (and others), I am planning on doing a full review of this series "soon", > > > but it will likely be a few weeks until that happens. I jumped in on this > > > specific thread because this caught my eye and I really don't want to throw out > > > *all* of the FOLL_GET usage. > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 5b5afd70f239..90d424990e0a 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr) > > > return rc == -EHWPOISON; > > > } > > > > > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, > > > + struct page *page) > > > +{ > > > + kvm_pfn_t pfn = page_to_pfn(page); > > > + > > > + foll->is_refcounted_page = true; > > > + > > > + /* > > > + * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller > > > + * doesn't want to grab a reference, but gup() doesn't support getting > > > + * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever > > > + * changes, drop this and simply don't pass FOLL_GET to gup(). > > > + */ > > > + if (!(foll->flags & FOLL_GET)) > > > + put_page(page); > > > + > > > + return pfn; > > > +} > > > + > > > /* > > > * The fast path to get the writable pfn which will be stored in @pfn, > > > * true indicates success, otherwise false is returned. It's also the > > > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > return false; > > > > > > if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > > > - *pfn = page_to_pfn(page[0]); > > > foll->writable = foll->allow_write_mapping; > > > - foll->is_refcounted_page = true; > > > - if (!(foll->flags & FOLL_GET)) > > > - put_page(page[0]); > > > + > > > + *pfn = kvm_follow_refcounted_pfn(foll, page[0]); > > > return true; > > > } > > > > > > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > return npages; > > > > > > foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping; > > > - foll->is_refcounted_page = true; > > > > > > /* map read fault as writable if possible */ > > > if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > page = wpage; > > > } > > > } > > > - *pfn = page_to_pfn(page); > > > - if (!(foll->flags & FOLL_GET)) > > > - put_page(page); > > > + > > > + *pfn = kvm_follow_refcounted_pfn(foll, page); > > > return npages; > > > } > > > > > > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn > > > if (!page) > > > goto out; > > > > > > - if (get_page_unless_zero(page)) { > > > - foll->is_refcounted_page = true; > > > - if (!(foll->flags & FOLL_GET)) > > > - put_page(page); > > > - } else if (foll->flags & FOLL_GET) { > > > - r = -EFAULT; > > > - } > > > - > > > + if (get_page_unless_zero(page)) > > > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); > > > out: > > > pte_unmap_unlock(ptep, ptl); > > > - *p_pfn = pfn; > > > + > > > + if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier && > > > + !allow_unsafe_mappings) > > > + r = -EFAULT; > > > + else > > > + *p_pfn = pfn; > > > > > > return r; > > > } > > > > > > > As I pointed out above, this suggestion is broken because a FOLL_GET > > && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a > > non-refcounted page will result in the refcount eventually being > > corrupted. > > I don't think so, unless I'm misunderstanding the concern. It just wasn't a > complete patch, and wasn't intended to be. I guess I misunderstood the scale of the changes you're suggesting. -David