Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support

Fuad Tabba <tabba@xxxxxxxxxx> · Thu, 16 Jan 2025 09:19:09 +0000

Hi Ackerley,

On Thu, 16 Jan 2025 at 00:35, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote:
>
> Fuad Tabba <tabba@xxxxxxxxxx> writes:
>
> > Hi,
> >
> > As mentioned in the guest_memfd sync (2025-01-09), below is the state
> > diagram that uses the new states in this patch series, and how they
> > would interact with sharing/unsharing in pKVM:
> >
> > https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
>
> Thanks Fuad!
>
> I took a look at the state diagram [1] and the branch that this patch is
> on [2], and here's what I understand about the flow:
>
> 1. From state H in the state diagram, the guest can request to unshare a
>    page. When KVM handles this unsharing, KVM marks the folio
>    mappability as NONE (state J).
> 2. The transition from state J to state K or I is independent of KVM -
>    userspace has to do this unmapping
> 3. On the next vcpu_run() from userspace, continuing from userspace's
>    handling of the unshare request, guest_memfd will check and try to
>    register a callback if the folio's mappability is NONE. If the folio
>    is mapped, or if folio is not mapped but refcount is elevated for
>    whatever reason, vcpu_run() fails and exits to userspace. If folio is
>    not mapped and gmem holds the last refcount, set folio mappability to
>    GUEST.
>
> Here's one issue I see based on the above understanding:
>
> Registration of the folio_put() callback only happens if the VMM
> actually tries to do vcpu_run(). For 4K folios I think this is okay
> since the 4K folio can be freed via the transition state K -> state I,
> but for hugetlb folios that have been split for sharing with userspace,
> not getting a folio_put() callback means never putting the hugetlb folio
> together. Hence, relying on vcpu_run() to add the folio_put() callback
> leaves a way that hugetlb pages can be removed from the system.
>
> I think we should try and find a path forward that works for both 4K and
> hugetlb folios.

I agree, this could be an issue, but we could find other ways to
trigger the callback for huge folios. The important thing I was trying
to get to is how to have the callback and be able to register it.

> IIUC page._mapcount and page.page_type works as a union because
> page_type is only set for page types that are never mapped to userspace,
> like PGTY_slab, PGTY_offline, etc.

In the last guest_memfd sync, David Hildenbrand mentioned that that
would be a temporary restriction since the two structures would
eventually be decoupled, work being done by Matthew Wilcox I believe.

> Technically PGTY_guest_memfd is only set once the page can never be
> mapped to userspace, but PGTY_guest_memfd can only be set once mapcount
> reaches 0. Since mapcount is added in the faulting process, could gmem
> perhaps use some kind of .unmap/.unfault callback, so that gmem gets
> notified of all unmaps and will know for sure that the mapcount gets to
> 0?

I'm not sure if there is such a callback. If there were, I'm not sure
what that would buy us really. The main pain point is the refcount
going down to zero. The mapcount part is pretty straightforard and
likely to be only temporary as mentioned, i.e., when it get decoupled,
we could register the callback earlier and simplify the transition
altogether.

> Alternatively, I took a look at the folio_is_zone_device()
> implementation, and page.flags is used to identify the page's type. IIUC
> a ZONE_DEVICE page also falls in the intersection of needing a
> folio_put() callback and can be mapped to userspace. Could we use a
> similar approach, using page.flags to identify a page as a guest_memfd
> page? That way we don't need to know when unmapping happens, and will
> always be able to get a folio_put() callback.

Same as above, with this being temporary, adding a new page flag might
not be something that the rest of the community might be too excited
about :)

Thanks for your comments!
/fuad

> [1] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
> [2] https://android-kvm.googlesource.com/linux/+/764360863785ba16d974253a572c87abdd9fdf0b%5E%21/#F0
>
> > This patch series doesn't necessarily impose all these transitions,
> > many of them would be a matter of policy. This just happens to be the
> > current way I've done it with pKVM/arm64.
> >
> > Cheers,
> > /fuad
> >
> > On Fri, 13 Dec 2024 at 16:48, Fuad Tabba <tabba@xxxxxxxxxx> wrote:
> >>
> >> This series adds restricted mmap() support to guest_memfd, as
> >> well as support for guest_memfd on arm64. It is based on Linux
> >> 6.13-rc2.  Please refer to v3 for the context [1].
> >>
> >> Main changes since v3:
> >> - Added a new folio type for guestmem, used to register a
> >>   callback when a folio's reference count reaches 0 (Matthew
> >>   Wilcox, DavidH) [2]
> >> - Introduce new mappability states for folios, where a folio can
> >> be mappable by the host and the guest, only the guest, or by no
> >> one (transient state)
> >> - Rebased on Linux 6.13-rc2
> >> - Refactoring and tidying up
> >>
> >> Cheers,
> >> /fuad
> >>
> >> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@xxxxxxxxxx/
> >> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@xxxxxxxxxx/
> >>
> >> Ackerley Tng (2):
> >>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> >>     anonymous inodes
> >>   KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
> >>
> >> Fuad Tabba (12):
> >>   mm: Consolidate freeing of typed folios on final folio_put()
> >>   KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
> >>     the folio lock
> >>   KVM: guest_memfd: Folio mappability states and functions that manage
> >>     their transition
> >>   KVM: guest_memfd: Handle final folio_put() of guestmem pages
> >>   KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
> >>   KVM: guest_memfd: Add guest_memfd support to
> >>     kvm_(read|/write)_guest_page()
> >>   KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
> >>     mappable
> >>   KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
> >>     mappable
> >>   KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
> >>     allowed
> >>   KVM: arm64: Skip VMA checks for slots without userspace address
> >>   KVM: arm64: Handle guest_memfd()-backed guest page faults
> >>   KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
> >>
> >>  Documentation/virt/kvm/api.rst                |   4 +
> >>  arch/arm64/include/asm/kvm_host.h             |   3 +
> >>  arch/arm64/kvm/Kconfig                        |   1 +
> >>  arch/arm64/kvm/mmu.c                          | 119 +++-
> >>  include/linux/kvm_host.h                      |  75 +++
> >>  include/linux/page-flags.h                    |  22 +
> >>  include/uapi/linux/kvm.h                      |   2 +
> >>  include/uapi/linux/magic.h                    |   1 +
> >>  mm/debug.c                                    |   1 +
> >>  mm/swap.c                                     |  28 +-
> >>  tools/testing/selftests/kvm/Makefile          |   1 +
> >>  .../testing/selftests/kvm/guest_memfd_test.c  |  64 +-
> >>  virt/kvm/Kconfig                              |   4 +
> >>  virt/kvm/guest_memfd.c                        | 579 +++++++++++++++++-
> >>  virt/kvm/kvm_main.c                           | 229 ++++++-
> >>  15 files changed, 1074 insertions(+), 59 deletions(-)
> >>
> >>
> >> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> >> --
> >> 2.47.1.613.gc27f4b7a9f-goog
> >>