[RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


This series adds restricted mmap() support to guest_memfd [1], as
well as support guest_memfd on pKVM/arm64.

This series is based on Linux 6.8-rc4 + our pKVM core series [2].
The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
the remainder (patches 7-26) require the pKVM core series. A git
repo with this series applied can be found here [3]. We have a
(WIP) kvmtool port capable of running the code in this series
[4]. For a technical deep dive into pKVM, please refer to Quentin
Perret's KVM Forum Presentation [5, 6].

I've covered some of the issues presented here in my LPC 2023
presentation [7].

We haven't started using this in Android yet, but we aim to move
away from anonymous memory to guest_memfd once we have the
necessary support merged upstream. Others (e.g., Gunyah [8]) are
also looking into guest_memfd for similar reasons as us.

By design, guest_memfd cannot be mapped, read, or written by the
host userspace. In pKVM, memory shared between a protected guest
and the host is shared in-place, unlike the other confidential
computing solutions that guest_memfd was originally envisaged for
(e.g, TDX). When initializing a guest, as well as when accessing
memory shared by the guest to the host, it would be useful to
support mapping that memory at the host to avoid copying its

One of the benefits of guest_memfd is that it prevents a
misbehaving host process from crashing the system when attempting
to access (deliberately or accidentally) protected guest memory,
since this memory isn't mapped to begin with. Without
guest_memfd, the hypervisor would still prevent such accesses,
but in certain cases the host kernel wouldn't be able to recover,
causing the system to crash.

Support for mmap() in this patch series maintains the invariant
that only memory shared with the host, either explicitly by the
guest or implicitly before the guest has started running (in
order to populate its memory) is allowed to be mapped. At no time
should private memory be mapped at the host.

This patch series is divided into two parts:

The first part is to the KVM core code (patches 1-6), and is
based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
for mapping guest memory only as long as it is shared. For that,
the host needs to know the sharing status of guest memory.
Therefore, the series adds a new KVM memory attribute, accessible
only by the host kernel, that specifies whether the memory is
allowed to be mapped by the host userspace.

The second part of the series (patches 7-26) adds guest_memfd
support for pKVM/arm64, and is based on the latest version of our
pKVM series [2]. It uses guest_memfd instead of the current
approach in Android (not upstreamed) of maintaining a long-term
GUP on anonymous memory donated to the guest. These patches
handle faulting in guest memory for a guest, as well as handling
sharing and unsharing of guest memory while maintaining the
invariant mentioned earlier.

In addition to general feedback, we would like feedback on how we
handle mmap() and faulting-in guest pages at the host (KVM: Add
restricted support for mapping guest_memfd by the host).

We don't enforce the invariant that only memory shared with the
host can be mapped by the host userspace in
file_operations::mmap(), but in vm_operations_struct:fault(). On
vm_operations_struct::fault(), we check whether the page is
shared with the host. If not, we deliver a SIGBUS to the current
task. The reason for enforcing this at fault() is that mmap()
does not elevate the pagecount(); it's the faulting in of the
page which does. Even if we were to check at mmap() whether an
address can be mapped, we would still need to check again on
fault(), since between mmap() and fault() the status of the page
can change.

This creates the situation where access to successfully mmap()'d
memory might SIGBUS at page fault. There is precedence for
similar behavior in the kernel I believe, with MADV_HWPOISON and
the hugetlbfs cgroups controller, which could SIGBUS at page
fault time depending on the accounting limit.

Another pKVM specific aspect we would like feedback on, is how to
handle memory mapped by the host being unshared by a guest. The
approach we've taken is that on an unshare call from the guest,
the host userspace is notified that the memory has been unshared,
in order to allow it to unmap it and mark it as PRIVATE as
acknowledgment. If the host does not unmap the memory, the
unshare call issued by the guest fails, which the guest is
informed about on return.


[1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@xxxxxxxxxx/

[2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core

[3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1

[4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8

[5] Protected KVM on arm64 (slides)

[6] Protected KVM on arm64 (video)

[7] Supporting guest private memory in Protected KVM on Android (presentation)

[8] Drivers for Gunyah (patch series)

Fuad Tabba (20):
  KVM: Split KVM memory attributes into user and kernel attributes
  KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  KVM: Add restricted support for mapping guestmem by the host
  KVM: Don't allow private attribute to be set if mapped by host
  KVM: Don't allow private attribute to be removed for unmappable memory
  KVM: Implement kvm_(read|/write)_guest_page for private memory slots
  KVM: arm64: Create hypercall return handler
  KVM: arm64: Refactor code around handling return from host to guest
  KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
  KVM: arm64: Add a field to indicate whether the guest page was pinned
  KVM: arm64: Do not allow changes to private memory slots
  KVM: arm64: Skip VMA checks for slots without userspace address
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Track sharing of memory from protected guest to host
  KVM: arm64: Mark a protected VM's memory as unmappable at
  KVM: arm64: Handle unshare on way back to guest entry rather than exit
  KVM: arm64: Check that host unmaps memory unshared by guest
  KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
  KVM: arm64: Enable private memory support when pKVM is enabled
  KVM: arm64: Enable private memory kconfig for arm64

Keir Fraser (3):
  KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
  KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
  KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall

Quentin Perret (1):
  KVM: arm64: Turn llist of pinned pages into an rb-tree

Will Deacon (2):
  KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
  KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications

 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_pkvm.h             |   1 +
 arch/arm64/kvm/Kconfig                        |   2 +
 arch/arm64/kvm/arm.c                          |  32 ++-
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
 arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
 arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
 arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
 arch/arm64/kvm/mmu.c                          | 138 +++++++++-
 arch/arm64/kvm/pkvm.c                         |  83 +++++-
 include/linux/arm-smccc.h                     |   7 +
 include/linux/kvm_host.h                      |  34 +++
 include/uapi/linux/kvm.h                      |   4 +
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        |  89 +++++-
 virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
 19 files changed, 904 insertions(+), 68 deletions(-)


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux