Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ackerley,

On Thu, 6 Feb 2025 at 03:14, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote:
>
> Fuad Tabba <tabba@xxxxxxxxxx> writes:
>
> > On Mon, 20 Jan 2025 at 10:30, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> >>
> >> On Fri, Jan 17, 2025 at 04:29:51PM +0000, Fuad Tabba wrote:
> >> > +/*
> >> > + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> >> > + * have any references to a particular folio, then that folio is marked as
> >> > + * mappable by the guest.
> >> > + *
> >> > + * However, if the host still has references to the folio, then the folio is
> >> > + * marked and not mappable by anyone. Marking it is not mappable allows it to
> >> > + * drain all references from the host, and to ensure that the hypervisor does
> >> > + * not transition the folio to private, since the host still might access it.
> >> > + *
> >> > + * Usually called when guest unshares memory with the host.
> >> > + */
> >> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >> > +{
> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> >> > +     pgoff_t i;
> >> > +     int r = 0;
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +     for (i = start; i < end; i++) {
> >> > +             struct folio *folio;
> >> > +             int refcount = 0;
> >> > +
> >> > +             folio = filemap_lock_folio(inode->i_mapping, i);
> >> > +             if (!IS_ERR(folio)) {
> >> > +                     refcount = folio_ref_count(folio);
> >> > +             } else {
> >> > +                     r = PTR_ERR(folio);
> >> > +                     if (WARN_ON_ONCE(r != -ENOENT))
> >> > +                             break;
> >> > +
> >> > +                     folio = NULL;
> >> > +             }
> >> > +
> >> > +             /* +1 references are expected because of filemap_lock_folio(). */
> >> > +             if (folio && refcount > folio_nr_pages(folio) + 1) {
> >>
> >> Looks racy.
> >>
> >> What prevent anybody from obtaining a reference just after check?
> >>
> >> Lock on folio doesn't stop random filemap_get_entry() from elevating the
> >> refcount.
> >>
> >> folio_ref_freeze() might be required.
> >
> > I thought the folio lock would be sufficient, but you're right,
> > nothing prevents getting a reference after the check. I'll use a
> > folio_ref_freeze() when I respin.
> >
> > Thanks,
> > /fuad
> >
>
> Is it correct to say that the only non-racy check for refcounts is a
> check for refcount == 0?
>
> What do you think of this instead: If there exists a folio, don't check
> the refcount, just set mappability to NONE and register the callback
> (the folio should already have been unmapped, which leaves
> folio->page_type available for use), and then drop the filemap's
> refcounts. When the filemap's refcounts are dropped, in most cases (no
> transient refcounts) the callback will be hit and the callback can set
> mappability to GUEST.
>
> If there are transient refcounts, the folio will just be waiting
> for the refcounts to drop to 0, and that's when the callback will be hit
> and the mappability can be transitioned to GUEST.
>
> If there isn't a folio, then guest_memfd was requested to set
> mappability ahead of any folio allocation, and in that case
> transitioning to GUEST immediately is correct.

This seems to me to add additional complexity to the common case that
isn't needed for correctness, and would make things more difficult to
reason about. If we know that there aren't any mappings at the host
(mapcount == 0), and we know that the refcount has at one point
reached 0 after we have taken the folio lock, even if the refcount
gets (transiently) elevated, we know that no one at the host is
accessing the folio itself.

Keep in mind that the common case (in a well behaved system) is that
neither the mapcount nor the refcount are elevated, and both for
performance, and for understanding, I think that's what we should be
targeting. Unless of course I'm wrong, and there's a correctness issue
here.

Cheers,
/fuad
> >> > +                     /*
> >> > +                      * Outstanding references, the folio cannot be faulted
> >> > +                      * in by anyone until they're dropped.
> >> > +                      */
> >> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> >> > +             } else {
> >> > +                     /*
> >> > +                      * No outstanding references. Transition the folio to
> >> > +                      * guest mappable immediately.
> >> > +                      */
> >> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> >> > +             }
> >> > +
> >> > +             if (folio) {
> >> > +                     folio_unlock(folio);
> >> > +                     folio_put(folio);
> >> > +             }
> >> > +
> >> > +             if (WARN_ON_ONCE(r))
> >> > +                     break;
> >> > +     }
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +
> >> > +     return r;
> >> > +}
> >>
> >> --
> >>   Kiryl Shutsemau / Kirill A. Shutemov




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux