This series adds restricted mmap() support to guest_memfd [1], as well as support guest_memfd on pKVM/arm64. This series is based on Linux 6.8-rc4 + our pKVM core series [2]. The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but the remainder (patches 7-26) require the pKVM core series. A git repo with this series applied can be found here [3]. We have a (WIP) kvmtool port capable of running the code in this series [4]. For a technical deep dive into pKVM, please refer to Quentin Perret's KVM Forum Presentation [5, 6]. I've covered some of the issues presented here in my LPC 2023 presentation [7]. We haven't started using this in Android yet, but we aim to move away from anonymous memory to guest_memfd once we have the necessary support merged upstream. Others (e.g., Gunyah [8]) are also looking into guest_memfd for similar reasons as us. By design, guest_memfd cannot be mapped, read, or written by the host userspace. In pKVM, memory shared between a protected guest and the host is shared in-place, unlike the other confidential computing solutions that guest_memfd was originally envisaged for (e.g, TDX). When initializing a guest, as well as when accessing memory shared by the guest to the host, it would be useful to support mapping that memory at the host to avoid copying its contents. One of the benefits of guest_memfd is that it prevents a misbehaving host process from crashing the system when attempting to access (deliberately or accidentally) protected guest memory, since this memory isn't mapped to begin with. Without guest_memfd, the hypervisor would still prevent such accesses, but in certain cases the host kernel wouldn't be able to recover, causing the system to crash. Support for mmap() in this patch series maintains the invariant that only memory shared with the host, either explicitly by the guest or implicitly before the guest has started running (in order to populate its memory) is allowed to be mapped. At no time should private memory be mapped at the host. This patch series is divided into two parts: The first part is to the KVM core code (patches 1-6), and is based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support for mapping guest memory only as long as it is shared. For that, the host needs to know the sharing status of guest memory. Therefore, the series adds a new KVM memory attribute, accessible only by the host kernel, that specifies whether the memory is allowed to be mapped by the host userspace. The second part of the series (patches 7-26) adds guest_memfd support for pKVM/arm64, and is based on the latest version of our pKVM series [2]. It uses guest_memfd instead of the current approach in Android (not upstreamed) of maintaining a long-term GUP on anonymous memory donated to the guest. These patches handle faulting in guest memory for a guest, as well as handling sharing and unsharing of guest memory while maintaining the invariant mentioned earlier. In addition to general feedback, we would like feedback on how we handle mmap() and faulting-in guest pages at the host (KVM: Add restricted support for mapping guest_memfd by the host). We don't enforce the invariant that only memory shared with the host can be mapped by the host userspace in file_operations::mmap(), but in vm_operations_struct:fault(). On vm_operations_struct::fault(), we check whether the page is shared with the host. If not, we deliver a SIGBUS to the current task. The reason for enforcing this at fault() is that mmap() does not elevate the pagecount(); it's the faulting in of the page which does. Even if we were to check at mmap() whether an address can be mapped, we would still need to check again on fault(), since between mmap() and fault() the status of the page can change. This creates the situation where access to successfully mmap()'d memory might SIGBUS at page fault. There is precedence for similar behavior in the kernel I believe, with MADV_HWPOISON and the hugetlbfs cgroups controller, which could SIGBUS at page fault time depending on the accounting limit. Another pKVM specific aspect we would like feedback on, is how to handle memory mapped by the host being unshared by a guest. The approach we've taken is that on an unshare call from the guest, the host userspace is notified that the memory has been unshared, in order to allow it to unmap it and mark it as PRIVATE as acknowledgment. If the host does not unmap the memory, the unshare call issued by the guest fails, which the guest is informed about on return. Cheers, /fuad [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@xxxxxxxxxx/ [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 [5] Protected KVM on arm64 (slides) https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf [6] Protected KVM on arm64 (video) https://www.youtube.com/watch?v=9npebeVFbFw [7] Supporting guest private memory in Protected KVM on Android (presentation) https://lpc.events/event/17/contributions/1487/ [8] Drivers for Gunyah (patch series) https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@xxxxxxxxxxx/ Fuad Tabba (20): KVM: Split KVM memory attributes into user and kernel attributes KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock KVM: Add restricted support for mapping guestmem by the host KVM: Don't allow private attribute to be set if mapped by host KVM: Don't allow private attribute to be removed for unmappable memory KVM: Implement kvm_(read|/write)_guest_page for private memory slots KVM: arm64: Create hypercall return handler KVM: arm64: Refactor code around handling return from host to guest KVM: arm64: Rename kvm_pinned_page to kvm_guest_page KVM: arm64: Add a field to indicate whether the guest page was pinned KVM: arm64: Do not allow changes to private memory slots KVM: arm64: Skip VMA checks for slots without userspace address KVM: arm64: Handle guest_memfd()-backed guest page faults KVM: arm64: Track sharing of memory from protected guest to host KVM: arm64: Mark a protected VM's memory as unmappable at initialization KVM: arm64: Handle unshare on way back to guest entry rather than exit KVM: arm64: Check that host unmaps memory unshared by guest KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() KVM: arm64: Enable private memory support when pKVM is enabled KVM: arm64: Enable private memory kconfig for arm64 Keir Fraser (3): KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall Quentin Perret (1): KVM: arm64: Turn llist of pinned pages into an rb-tree Will Deacon (2): KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications arch/arm64/include/asm/kvm_host.h | 17 +- arch/arm64/include/asm/kvm_pkvm.h | 1 + arch/arm64/kvm/Kconfig | 2 + arch/arm64/kvm/arm.c | 32 ++- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 + arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + arch/arm64/kvm/hyp/nvhe/hyp-main.c | 24 +- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 67 +++++ arch/arm64/kvm/hyp/nvhe/pkvm.c | 89 +++++- arch/arm64/kvm/hyp/nvhe/switch.c | 1 + arch/arm64/kvm/hypercalls.c | 117 +++++++- arch/arm64/kvm/mmu.c | 138 +++++++++- arch/arm64/kvm/pkvm.c | 83 +++++- include/linux/arm-smccc.h | 7 + include/linux/kvm_host.h | 34 +++ include/uapi/linux/kvm.h | 4 + virt/kvm/Kconfig | 4 + virt/kvm/guest_memfd.c | 89 +++++- virt/kvm/kvm_main.c | 260 ++++++++++++++++-- 19 files changed, 904 insertions(+), 68 deletions(-) -- 2.44.0.rc1.240.g4c46232300-goog