Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

Elliot Berman <quic_eberman@xxxxxxxxxxx> · Thu, 8 Aug 2024 14:42:57 -0700

On Thu, Aug 08, 2024 at 06:51:14PM +0000, Ackerley Tng wrote:
> Elliot Berman <quic_eberman@xxxxxxxxxxx> writes:
> 
> > Confidential/protected guest virtual machines want to share some memory
> > back with the host Linux. For example, virtqueues allow host and
> > protected guest to exchange data. In MMU-only isolation of protected
> > guest virtual machines, the transition between "shared" and "private"
> > can be done in-place without a trusted hypervisor copying pages.
> >
> > Add support for this feature and allow Linux to mmap host-accessible
> > pages. When the owner provides an ->accessible() callback in the
> > struct guest_memfd_operations, guest_memfd allows folios to be mapped
> > when the ->accessible() callback returns 0.
> >
> > To safely make inaccessible:
> >
> > ```
> > folio = guest_memfd_grab_folio(inode, index, flags);
> > r = guest_memfd_make_inaccessible(inode, folio);
> > if (r)
> >         goto err;
> >
> > hypervisor_does_guest_mapping(folio);
> >
> > folio_unlock(folio);
> > ```
> >
> > hypervisor_does_s2_mapping(folio) should make it so
> > ops->accessible(...) on those folios fails.
> >
> > The folio lock ensures atomicity.
> 
> I am also working on determining faultability not based on the
> private-ness of the page but based on permission given by the
> guest. I'd like to learn from what you've discovered here.
> 
> Could you please elaborate on this? What races is the folio_lock
> intended to prevent, what operations are we ensuring atomicity of?

The contention I've been paying most attention to are racing userspace
and vcpu faults where guest needs the page to be private. There could
also be multiple vcpus demanding same page.

We had some chatter about doing the private->shared conversion via
separate ioctl (mem attributes). I think the same race can happen with
userspace whether it's vcpu fault or ioctl making the folio "finally
guest-private".

Also, in non-CoCo KVM private guest_memfd, KVM or userspace could also
convert private->shared and need to make sure that all the tracking for
the current state is consistent.

> Is this why you did a guest_memfd_grab_folio() before checking
> ->accessible(), and then doing folio_unlock() if the page is
> inaccessible?
> 

Right, I want to guard against userspace being able to fault in a page
concurrently with that same page doing a shared->private conversion. The
folio_lock seems like the best fine-grained lock to grab.

If the shared->private converter wins the folio_lock first, then the
userspace fault waits and will see ->accessible() == false as desired. 

If userspace fault wins the folio_lock first, it relinquishes the lock
only after installing the folio in page tables[*]. When the
shared->private converter finally gets the lock,
guest_memfd_make_inaccessible() will be able to unmap the folio from any
userspace page tables (and direct map, if applicable).

[*]: I'm not mm expert, but that was what I could find when I went
digging.

Thanks,
Elliot

> >
> > Signed-off-by: Elliot Berman <quic_eberman@xxxxxxxxxxx>
> > ---
> >  include/linux/guest_memfd.h |  7 ++++
> >  mm/guest_memfd.c            | 81 ++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 87 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h
> > index f9e4a27aed67..edcb4ba60cb0 100644
> > --- a/include/linux/guest_memfd.h
> > +++ b/include/linux/guest_memfd.h
> > @@ -16,12 +16,18 @@
> >   * @invalidate_end: called after invalidate_begin returns success. Optional.
> >   * @prepare: called before a folio is mapped into the guest address space.
> >   *           Optional.
> > + * @accessible: called after prepare returns success and before it's mapped
> > + *              into the guest address space. Returns 0 if the folio can be
> > + *              accessed.
> > + *              Optional. If not present, assumes folios are never accessible.
> >   * @release: Called when releasing the guest_memfd file. Required.
> >   */
> >  struct guest_memfd_operations {
> >  	int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr);
> >  	void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr);
> >  	int (*prepare)(struct inode *inode, pgoff_t offset, struct folio *folio);
> > +	int (*accessible)(struct inode *inode, struct folio *folio,
> > +			  pgoff_t offset, unsigned long nr);
> >  	int (*release)(struct inode *inode);
> >  };
> >  
> > @@ -48,5 +54,6 @@ struct file *guest_memfd_alloc(const char *name,
> >  			       const struct guest_memfd_operations *ops,
> >  			       loff_t size, unsigned long flags);
> >  bool is_guest_memfd(struct file *file, const struct guest_memfd_operations *ops);
> > +int guest_memfd_make_inaccessible(struct file *file, struct folio *folio);
> >  
> >  #endif
> > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c
> > index e9d8cab72b28..6b5609932ca5 100644
> > --- a/mm/guest_memfd.c
> > +++ b/mm/guest_memfd.c
> > @@ -9,6 +9,8 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/set_memory.h>
> >  
> > +#include "internal.h"
> > +
> >  static inline int guest_memfd_folio_private(struct folio *folio)
> >  {
> >  	unsigned long nr_pages = folio_nr_pages(folio);
> > @@ -89,7 +91,7 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >  			goto out_err;
> >  	}
> >  
> > -	if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> > +	if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) {
> >  		r = guest_memfd_folio_private(folio);
> >  		if (r)
> >  			goto out_err;
> > @@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >  }
> >  EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);
> >  
> > +int guest_memfd_make_inaccessible(struct file *file, struct folio *folio)
> > +{
> > +	unsigned long gmem_flags = (unsigned long)file->private_data;
> > +	unsigned long i;
> > +	int r;
> > +
> > +	unmap_mapping_folio(folio);
> > +
> > +	/**
> > +	 * We can't use the refcount. It might be elevated due to
> > +	 * guest/vcpu trying to access same folio as another vcpu
> > +	 * or because userspace is trying to access folio for same reason
> > +	 *
> > +	 * folio_lock serializes the transitions between (in)accessible
> > +	 */
> > +	if (folio_maybe_dma_pinned(folio))
> > +		return -EBUSY;
> > +
> > +	if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> > +		r = guest_memfd_folio_private(folio);
> > +		if (r)
> > +			return r;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static vm_fault_t gmem_fault(struct vm_fault *vmf)
> > +{
> > +	struct file *file = vmf->vma->vm_file;
> > +	struct inode *inode = file_inode(file);
> > +	const struct guest_memfd_operations *ops = inode->i_private;
> > +	struct folio *folio;
> > +	pgoff_t off;
> > +	int r;
> > +
> > +	folio = guest_memfd_grab_folio(file, vmf->pgoff, GUEST_MEMFD_GRAB_UPTODATE);
> 
> Could grabbing the folio with GUEST_MEMFD_GRAB_UPTODATE cause unintended
> zeroing of the page if the page turns out to be inaccessible?
> 

I assume that if page is inaccessible, it would already have been marked
up to date and we wouldn't try to zero the page.

I'm thinking that if hypervisor zeroes the page when making the page
private, it would not give the GUEST_MEMFD_GRAB_UPTODATE flag when
grabbing the folio. I believe the hypervisor should know when grabbing
the folio if it's about to donate to the guest.

Thanks,
Elliot