Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map

Elliot Berman <quic_eberman@xxxxxxxxxxx> · Thu, 8 Aug 2024 15:16:09 -0700

On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote:
> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote:
> >>>>>>  struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> >>>>>>  {
> >>>>>> +       unsigned long gmem_flags = (unsigned long)file->private_data;
> >>>>>>         struct inode *inode = file_inode(file);
> >>>>>>         struct guest_memfd_operations *ops = inode->i_private;
> >>>>>>         struct folio *folio;
> >>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >>>>>>                         goto out_err;
> >>>>>>         }
> >>>>>>
> >>>>>> +       if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> >>>>>> +               r = guest_memfd_folio_private(folio);
> >>>>>> +               if (r)
> >>>>>> +                       goto out_err;
> >>>>>> +       }
> >>>>>> +
> >>>>>
> >>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs
> >>>>> to be removed from the direct map? E.g. how can a caller know ahead of
> >>>>> time whether guest_memfd_grab_folio will return a freshly allocated
> >>>>> folio (which thus needs to be removed from the direct map), vs a folio
> >>>>> that already exists and has been removed from the direct map (probably
> >>>>> fine to remove from direct map again), vs a folio that already exists
> >>>>> and is currently re-inserted into the direct map for whatever reason
> >>>>> (must not remove these from the direct map, as other parts of
> >>>>> KVM/userspace probably don't expect the direct map entries to disappear
> >>>>> from underneath them). I couldn't figure this one out for my series,
> >>>>> which is why I went with hooking into the PG_uptodate logic to always
> >>>>> remove direct map entries on freshly allocated folios.
> >>>>>
> >>>>
> >>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
> >>
> >> Ah, oops, I got it mixed up with the new `flags` parameter.
> >>
> >>>> to restore the direct map right away, it'd have to be a direct
> >>>> operation. As an optimization, we could add option that asks for page in
> >>>> "shared" state. If allocating new page, we can return it right away
> >>>> without removing from direct map. If grabbing existing folio, it would
> >>>> try to do the private->shared conversion.
> >>
> >> My concern is more with the implicit shared->private conversion that
> >> happens on every call to guest_memfd_grab_folio (and thus
> >> kvm_gmem_get_pfn) when grabbing existing folios. If something else
> >> marked the folio as shared, then we cannot punch it out of the direct
> >> map again until that something is done using the folio (when working on
> >> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
> >> temporarily marked shared, as I was seeing panics because of this). And
> >> if the folio is currently private, there's nothing to do. So either way,
> >> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
> >> folios.
> >>
> >
> > What I did could be documented/commented better.
> 
> No worries, thanks for taking the time to walk me through understanding
> it!
> 
> > If ops->accessible() is *not* provided, all guest_memfd allocations will
> > immediately remove from direct map and treat them immediately like guest
> > private (goal is to match what KVM does today on tip).
> 
> Ah, so if ops->accessible() is not provided, then there will never be
> any shared memory inside gmem (like today, where gmem doesn't support
> shared memory altogether), and thus there's no problems with just
> unconditionally doing set_direct_map_invalid_noflush in
> guest_memfd_grab_folio, because all existing folios already have their
> direct map entry removed. Got it!
> 
> > If ops->accessible() is provided, then guest_memfd allocations start
> > as "shared" and KVM/Gunyah need to do the shared->private conversion
> > when they want to do the private conversion on the folio. "Shared" is
> > the default because that is effectively a no-op.
> > For the non-CoCo case you're interested in, we'd have the
> > ops->accessible() provided and we wouldn't pull out the direct map from
> > gpc.
> 
> So in pKVM/Gunyah's case, guest memory starts as shared, and at some
> point the guest will issue a hypercall (or similar) to flip it to
> private, at which point it'll get removed from the direct map?
> 
> That isn't really what we want for our case. We consider the folios as
> private straight away, as we do not let the guest control their state at
> all. Everything is always "accessible" to both KVM and userspace in the
> sense that they can just flip gfns to shared as they please without the
> guest having any say in it.
> 
> I think we should untangle the behavior of guest_memfd_grab_folio from
> the presence of ops->accessible. E.g.  instead of direct map removal
> being dependent on ops->accessible we should have some
> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all,
> and not set for us (I don't think we should have a "call
> set_direct_map_invalid_noflush unconditionally in
> guest_memfd_grab_folio" mode at all, because if sharing gmem is
> supported, then that is broken, and if sharing gmem is not supported
> then only removing direct map entries for freshly allocated folios gets
> us the same result of "all folios never in the direct map" while
> avoiding some no-op direct map operations).
> 
> Because we would still use ->accessible, albeit for us that would be
> more for bookkeeping along the lines of "which gfns does userspace
> currently require to be in the direct map?". I haven't completely
> thought it through, but what I could see working for us would be a pair
> of ioctls for marking ranges accessible/inaccessible, with
> "accessibility" stored in some xarray (somewhat like Fuad's patches, I
> guess? [1]).
> 
> In a world where we have a "sharing refcount", the "make accessible"
> ioctl reinserts into the direct map (if needed), lifts the "sharings
> refcount" for each folio in the given gfn range, and marks the range as
> accessible.  And the "make inaccessible" ioctl would first check that
> userspace has unmapped all those gfns again, and if yes, mark them as
> inaccessible, drop the "sharings refcount" by 1 for each, and removes
> from the direct map again if it held the last reference (if userspace
> still has some gfns mapped, the ioctl would just fail).
> 

I am warming up to the sharing refcount idea. How does the sharing
refcount look for kvm gpc?

> I guess for pKVM/Gunyah, there wouldn't be userspace ioctls, but instead
> the above would happen in handlers for share/unshare hypercalls. But the
> overall flow would be similar. The only difference is the default state
> of guest memory (shared for you, private for us). You want a
> guest_memfd_grab_folio that essentially returns folios with "sharing
> refcount == 1" (and thus present in the direct map), while we want the
> opposite.
> 
> So I think something like the following should work for both of us
> (modulo some error handling):
> 
> static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare, bool *fresh)
> {
>     // as today's kvm_gmem_get_folio, except
>     ...
>     if (!folio_test_uptodate(folio)) {
>         ...
>         if (fresh)
>             *fresh = true
>     }
>     ...
> }
> 
> struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare)
> {
>     bool fresh;
>     unsigned long gmem_flags = /* ... */
>     struct folio *folio = __kvm_gmem_get_folio(file, index, prepare, &fresh);
>     if (gmem_flag & GRAB_FOLIO_RETURN_SHARED != 0) {
>         // if "sharing refcount == 0", inserts back into direct map and lifts refcount, otherwise just lifts refcount
>         guest_memfd_folio_clear_private(folio);
>     } else {
>         if (fresh)
>             guest_memfd_folio_private(folio);
>     }
>     return folio;
> }
> 
> Now, thinking ahead, there's probably optimizations here where we defer
> the direct map manipulations to gmem_fault, at which point having a
> guest_memfd_grab_folio that doesn't remove direct map entries for fresh
> folios would be useful in our non-CoCo usecase too. But that should also
> be easily achievable by maybe having a flag to kvm_gmem_get_folio that
> forces the behavior of GRAB_FOLIO_RETURN_SHARED, indendently of whether
> GRAB_FOLIO_RETURN_SHARED is set in gmem_flags.
> 
> How does that sound to you?
> 

Yeah, I think this is a good idea.

I'm also thinking to make a few tweaks to the ops structure:

struct guest_memfd_operations {
        int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr);
        void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr);
        int (*prepare_accessible)(struct inode *inode, struct folio *folio);
        int (*prepare_private)(struct inode *inode, struct folio *folio);
        int (*release)(struct inode *inode);
};

When grabbing a folio, we'd always call either prepare_accessible() or
prepare_private() based on GRAB_FOLIO_RETURN_SHARED. In the
prepare_private() case, guest_memfd can also ensure the folio is
unmapped and not pinned. If userspace tries to grab the folio in
pKVM/Gunyah case, prepare_accessible() will fail and grab_folio returns
error. There's a lot of details I'm glossing over, but I hope it gives
some brief idea of the direction I was thinking.

In some cases, prepare_accessible() and the invalidate_*() functions
might effectively be the same thing, except that invalidate_*() could
operate on a range larger-than-a-folio. That would be useful becase we
might offer optimization to reclaim a batch of pages versus e.g.
flushing caches every page.

Thanks,
Elliot