Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 07 Feb 2025 10:46:13 +0000

Vishal Annapurve <vannapurve@xxxxxxxxxx> writes:

> On Wed, Feb 5, 2025 at 9:39 AM Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:
>>
>> On Wed, Feb 5, 2025 at 2:07 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote:
>> >
>> > Hi Vishal,
>> >
>> > On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:
>> > >
>> > > On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote:
>> > > >
>> > > > Before transitioning a guest_memfd folio to unshared, thereby
>> > > > disallowing access by the host and allowing the hypervisor to
>> > > > transition its view of the guest page as private, we need to be
>> > > > sure that the host doesn't have any references to the folio.
>> > > >
>> > > > This patch introduces a new type for guest_memfd folios, and uses
>> > > > that to register a callback that informs the guest_memfd
>> > > > subsystem when the last reference is dropped, therefore knowing
>> > > > that the host doesn't have any remaining references.
>> > > >
>> > > > Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx>
>> > > > ---
>> > > > The function kvm_slot_gmem_register_callback() isn't used in this
>> > > > series. It will be used later in code that performs unsharing of
>> > > > memory. I have tested it with pKVM, based on downstream code [*].
>> > > > It's included in this RFC since it demonstrates the plan to
>> > > > handle unsharing of private folios.
>> > > >
>> > > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
>> > >
>> > > Should the invocation of kvm_slot_gmem_register_callback() happen in
>> > > the same critical block as setting the guest memfd range mappability
>> > > to NONE, otherwise conversion/truncation could race with registration
>> > > of callback?
>> >
>> > I don't think it needs to, at least not as far potencial races are
>> > concerned. First because kvm_slot_gmem_register_callback() grabs the
>> > mapping's invalidate_lock as well as the folio lock, and
>> > gmem_clear_mappable() grabs the mapping lock and the folio lock if a
>> > folio has been allocated before.
>>
>> I was hinting towards such a scenario:
>> Core1
>> Shared to private conversion
>>   -> Results in mappability attributes
>>       being set to NONE
>> ...
>>         Trigger private to shared conversion/truncation for
>> ...
>>         overlapping ranges
>> ...
>> kvm_slot_gmem_register_callback() on
>>       the guest_memfd ranges converted
>>       above (This will end up registering callback
>>       for guest_memfd ranges which possibly don't
>>       carry *_MAPPABILITY_NONE)
>>
>
> Sorry for the format mess above.
>
> I was hinting towards such a scenario:
> Core1-
> Shared to private conversion -> Results in mappability attributes
> being set to NONE
> ...
> Core2
> Trigger private to shared conversion/truncation for overlapping ranges
> ...
> Core1
> kvm_slot_gmem_register_callback() on the guest_memfd ranges converted
> above (This will end up registering callback for guest_memfd ranges
> which possibly don't carry *_MAPPABILITY_NONE)
>

In my model (I'm working through internal processes to open source this)
I set up the the folio_put() callback to be registered on truncation
regardless of mappability state.

The folio_put() callback has multiple purposes, see slide 5 of this deck
[1]:

1. Transitioning mappability from NONE to GUEST
2. Merging the folio if it is ready for merging
3. Keeping subfolio around (even if refcount == 0) until folio is ready
   for merging or return it to hugetlb

So it is okay and in fact better to have the callback registered:

1. Folios with mappability == NONE can be transitioned to GUEST
2. Folios with mappability == GUEST/ALL can be merged if the other subfolios
   are ready for merging
3. And no matter the mappability, if subfolios are not yet merged, they
   have to be kept around even with refcount 0 until they are merged.

The model doesn't model locking so I'll have to code it up for real to
verify this, but for now I think we should take a mappability lock
during mappability read/write, and do any necessary callback
(un)registration while holding the lock. There's no concern of nested
locking here since callback registration will purely (un)set
PGTY_guest_memfd and does not add/drop refcounts.

With the callback registration locked with mappability updates, the
refcounting and folio_put() callback should keep guest_memfd in a
consistent state.

>> >
>> > Second, __gmem_register_callback() checks before returning whether all
>> > references have been dropped, and adjusts the mappability/shareability
>> > if needed.
>> >
>> > Cheers,
>> > /fuad

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf