Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map

Patrick Roy <roypat@xxxxxxxxxxxx> · Fri, 9 Aug 2024 16:02:16 +0100

On Thu, 2024-08-08 at 23:16 +0100, Elliot Berman wrote
> On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote:
>> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote:
>>>>>>>>  struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
>>>>>>>>  {
>>>>>>>> +       unsigned long gmem_flags = (unsigned long)file->private_data;
>>>>>>>>         struct inode *inode = file_inode(file);
>>>>>>>>         struct guest_memfd_operations *ops = inode->i_private;
>>>>>>>>         struct folio *folio;
>>>>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
>>>>>>>>                         goto out_err;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> +       if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
>>>>>>>> +               r = guest_memfd_folio_private(folio);
>>>>>>>> +               if (r)
>>>>>>>> +                       goto out_err;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>
>>>>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs
>>>>>>> to be removed from the direct map? E.g. how can a caller know ahead of
>>>>>>> time whether guest_memfd_grab_folio will return a freshly allocated
>>>>>>> folio (which thus needs to be removed from the direct map), vs a folio
>>>>>>> that already exists and has been removed from the direct map (probably
>>>>>>> fine to remove from direct map again), vs a folio that already exists
>>>>>>> and is currently re-inserted into the direct map for whatever reason
>>>>>>> (must not remove these from the direct map, as other parts of
>>>>>>> KVM/userspace probably don't expect the direct map entries to disappear
>>>>>>> from underneath them). I couldn't figure this one out for my series,
>>>>>>> which is why I went with hooking into the PG_uptodate logic to always
>>>>>>> remove direct map entries on freshly allocated folios.
>>>>>>>
>>>>>>
>>>>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
>>>>
>>>> Ah, oops, I got it mixed up with the new `flags` parameter.
>>>>
>>>>>> to restore the direct map right away, it'd have to be a direct
>>>>>> operation. As an optimization, we could add option that asks for page in
>>>>>> "shared" state. If allocating new page, we can return it right away
>>>>>> without removing from direct map. If grabbing existing folio, it would
>>>>>> try to do the private->shared conversion.
>>>>
>>>> My concern is more with the implicit shared->private conversion that
>>>> happens on every call to guest_memfd_grab_folio (and thus
>>>> kvm_gmem_get_pfn) when grabbing existing folios. If something else
>>>> marked the folio as shared, then we cannot punch it out of the direct
>>>> map again until that something is done using the folio (when working on
>>>> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
>>>> temporarily marked shared, as I was seeing panics because of this). And
>>>> if the folio is currently private, there's nothing to do. So either way,
>>>> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
>>>> folios.
>>>>
>>>
>>> What I did could be documented/commented better.
>>
>> No worries, thanks for taking the time to walk me through understanding
>> it!
>>
>>> If ops->accessible() is *not* provided, all guest_memfd allocations will
>>> immediately remove from direct map and treat them immediately like guest
>>> private (goal is to match what KVM does today on tip).
>>
>> Ah, so if ops->accessible() is not provided, then there will never be
>> any shared memory inside gmem (like today, where gmem doesn't support
>> shared memory altogether), and thus there's no problems with just
>> unconditionally doing set_direct_map_invalid_noflush in
>> guest_memfd_grab_folio, because all existing folios already have their
>> direct map entry removed. Got it!
>>
>>> If ops->accessible() is provided, then guest_memfd allocations start
>>> as "shared" and KVM/Gunyah need to do the shared->private conversion
>>> when they want to do the private conversion on the folio. "Shared" is
>>> the default because that is effectively a no-op.
>>> For the non-CoCo case you're interested in, we'd have the
>>> ops->accessible() provided and we wouldn't pull out the direct map from
>>> gpc.
>>
>> So in pKVM/Gunyah's case, guest memory starts as shared, and at some
>> point the guest will issue a hypercall (or similar) to flip it to
>> private, at which point it'll get removed from the direct map?
>>
>> That isn't really what we want for our case. We consider the folios as
>> private straight away, as we do not let the guest control their state at
>> all. Everything is always "accessible" to both KVM and userspace in the
>> sense that they can just flip gfns to shared as they please without the
>> guest having any say in it.
>>
>> I think we should untangle the behavior of guest_memfd_grab_folio from
>> the presence of ops->accessible. E.g.  instead of direct map removal
>> being dependent on ops->accessible we should have some
>> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all,
>> and not set for us (I don't think we should have a "call
>> set_direct_map_invalid_noflush unconditionally in
>> guest_memfd_grab_folio" mode at all, because if sharing gmem is
>> supported, then that is broken, and if sharing gmem is not supported
>> then only removing direct map entries for freshly allocated folios gets
>> us the same result of "all folios never in the direct map" while
>> avoiding some no-op direct map operations).
>>
>> Because we would still use ->accessible, albeit for us that would be
>> more for bookkeeping along the lines of "which gfns does userspace
>> currently require to be in the direct map?". I haven't completely
>> thought it through, but what I could see working for us would be a pair
>> of ioctls for marking ranges accessible/inaccessible, with
>> "accessibility" stored in some xarray (somewhat like Fuad's patches, I
>> guess? [1]).
>>
>> In a world where we have a "sharing refcount", the "make accessible"
>> ioctl reinserts into the direct map (if needed), lifts the "sharings
>> refcount" for each folio in the given gfn range, and marks the range as
>> accessible.  And the "make inaccessible" ioctl would first check that
>> userspace has unmapped all those gfns again, and if yes, mark them as
>> inaccessible, drop the "sharings refcount" by 1 for each, and removes
>> from the direct map again if it held the last reference (if userspace
>> still has some gfns mapped, the ioctl would just fail).
>>
> 
> I am warming up to the sharing refcount idea. How does the sharing
> refcount look for kvm gpc?

I've come up with the below rough draft (written as a new commit on
top of my RFC series [1], with some bits from your patch copied in).
With this, I was able to actually boot a Firecracker VM with
multiple vCPUs (which previously didn't work because of different vCPUs
putting their kvm-clock structures into the same guest page). 

Best, 
Patrick

[1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/T/#ma44793da6bc000a2c22b1ffe37292b9615881838

---