Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map

Elliot Berman <quic_eberman@xxxxxxxxxxx> · Wed, 7 Aug 2024 12:06:33 -0700

On Wed, Aug 07, 2024 at 11:57:35AM +0100, Patrick Roy wrote:
> On Wed, 2024-08-07 at 07:48 +0100, Patrick Roy wrote:
> > On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote:
> >> On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote:
> >>> On the other hand, as Paolo pointed out in my patches [1], just using a
> >>> page flag to track direct map presence for gmem is not enough. We
> >>> actually need to keep a refcount in folio->private to keep track of how
> >>> many different actors request a folio's direct map presence (in the
> >>> specific case in my patch series, it was different pfn_to_gfn_caches for
> >>> the kvm-clock structures of different vcpus, which the guest can place
> >>> into the same gfn). While this might not be a concern for the the
> >>> pKVM/Gunyah case, where the guest dictates memory state, it's required
> >>> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns
> >>> to shared if it needs/wants to access them for whatever reason. So for
> >>> this we'd need to have PG_private=1 mean "direct map entry restored" (as
> >>> if PG_private=0, there is no folio->private).
> >>>
> >>> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315
> >>>
> >>
> >> I wonder if we can use the folio refcount itself, assuming we can rely
> >> on refcount == 1 means we can do shared->private conversion.
> >>
> >> In gpc_map_gmem, we convert private->shared. There's no problem here in
> >> the non-CoCo case.
> >>
> >> In gpc_unmap, we *try* to convert back from shared->private. If
> >> refcount>2, then the conversion would fail. The last gpc_unmap would be
> >> able to successfully convert back to private.
> >>
> >> Do you see any concerns with this approach?
> > 
> > The gfn_to_pfn_cache does not keep an elevated refcount on the cached
> > page, and instead responds to MMU notifiers to detect whether the cached
> > translation has been invalidated, iirc. So the folio refcount will
> > not reflect the number of gpcs holding that folio.
> > 

Ah, fair enough. This is kinda like a GUP pin which would prevent us
from making page private, but without the pin part.

[...]

> >>>>  struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> >>>>  {
> >>>> +       unsigned long gmem_flags = (unsigned long)file->private_data;
> >>>>         struct inode *inode = file_inode(file);
> >>>>         struct guest_memfd_operations *ops = inode->i_private;
> >>>>         struct folio *folio;
> >>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >>>>                         goto out_err;
> >>>>         }
> >>>>
> >>>> +       if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> >>>> +               r = guest_memfd_folio_private(folio);
> >>>> +               if (r)
> >>>> +                       goto out_err;
> >>>> +       }
> >>>> +
> >>>
> >>> How does a caller of guest_memfd_grab_folio know whether a folio needs
> >>> to be removed from the direct map? E.g. how can a caller know ahead of
> >>> time whether guest_memfd_grab_folio will return a freshly allocated
> >>> folio (which thus needs to be removed from the direct map), vs a folio
> >>> that already exists and has been removed from the direct map (probably
> >>> fine to remove from direct map again), vs a folio that already exists
> >>> and is currently re-inserted into the direct map for whatever reason
> >>> (must not remove these from the direct map, as other parts of
> >>> KVM/userspace probably don't expect the direct map entries to disappear
> >>> from underneath them). I couldn't figure this one out for my series,
> >>> which is why I went with hooking into the PG_uptodate logic to always
> >>> remove direct map entries on freshly allocated folios.
> >>>
> >>
> >> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
> 
> Ah, oops, I got it mixed up with the new `flags` parameter. 
> 
> >> to restore the direct map right away, it'd have to be a direct
> >> operation. As an optimization, we could add option that asks for page in
> >> "shared" state. If allocating new page, we can return it right away
> >> without removing from direct map. If grabbing existing folio, it would
> >> try to do the private->shared conversion.
> 
> My concern is more with the implicit shared->private conversion that
> happens on every call to guest_memfd_grab_folio (and thus
> kvm_gmem_get_pfn) when grabbing existing folios. If something else
> marked the folio as shared, then we cannot punch it out of the direct
> map again until that something is done using the folio (when working on
> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
> temporarily marked shared, as I was seeing panics because of this). And
> if the folio is currently private, there's nothing to do. So either way,
> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
> folios.
> 

What I did could be documented/commented better.

If ops->accessible() is *not* provided, all guest_memfd allocations will
immediately remove from direct map and treat them immediately like guest
private (goal is to match what KVM does today on tip). 

If ops->accessible() is provided, then guest_memfd allocations start
as "shared" and KVM/Gunyah need to do the shared->private conversion
when they want to do the private conversion on the folio. "Shared" is
the default because that is effectively a no-op.

For the non-CoCo case you're interested in, we'd have the
ops->accessible() provided and we wouldn't pull out the direct map from
gpc.

Thanks,
Elliot