On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote: > This series adds restricted mmap() support to guest_memfd [1], as > well as support guest_memfd on pKVM/arm64. > > We haven't started using this in Android yet, but we aim to move > away from anonymous memory to guest_memfd once we have the > necessary support merged upstream. Others (e.g., Gunyah [8]) are > also looking into guest_memfd for similar reasons as us. I'm especially interested to see if we can factor out much of the common implementation bits between KVM and Gunyah. In principle, we're doing same thing: the difference is the exact mechanics to interact with the hypervisor which (I think) could be easily extracted into an ops structure. [...] > In addition to general feedback, we would like feedback on how we > handle mmap() and faulting-in guest pages at the host (KVM: Add > restricted support for mapping guest_memfd by the host). > > We don't enforce the invariant that only memory shared with the > host can be mapped by the host userspace in > file_operations::mmap(), but in vm_operations_struct:fault(). On > vm_operations_struct::fault(), we check whether the page is > shared with the host. If not, we deliver a SIGBUS to the current > task. The reason for enforcing this at fault() is that mmap() > does not elevate the pagecount(); it's the faulting in of the > page which does. Even if we were to check at mmap() whether an > address can be mapped, we would still need to check again on > fault(), since between mmap() and fault() the status of the page > can change. > > This creates the situation where access to successfully mmap()'d > memory might SIGBUS at page fault. There is precedence for > similar behavior in the kernel I believe, with MADV_HWPOISON and > the hugetlbfs cgroups controller, which could SIGBUS at page > fault time depending on the accounting limit. I added a test: folio_mmapped() [1] which checks if there's a vma covering the corresponding offset into the guest_memfd. I use this test before trying to make page private to guest and I've been able to ensure that userspace can't even mmap() private guest memory. If I try to make memory private, I can test that it's not mmapped and not allow memory to become private. In my testing so far, this is enough to prevent SIGBUS from happening. This test probably should be moved outside Gunyah specific code, and was looking for maintainer to suggest the right home for it :) [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@xxxxxxxxxxx/ > > Another pKVM specific aspect we would like feedback on, is how to > handle memory mapped by the host being unshared by a guest. The > approach we've taken is that on an unshare call from the guest, > the host userspace is notified that the memory has been unshared, > in order to allow it to unmap it and mark it as PRIVATE as > acknowledgment. If the host does not unmap the memory, the > unshare call issued by the guest fails, which the guest is > informed about on return. > > Cheers, > /fuad > > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@xxxxxxxxxx/ > > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core > > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 > > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 > > [5] Protected KVM on arm64 (slides) > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf > > [6] Protected KVM on arm64 (video) > https://www.youtube.com/watch?v=9npebeVFbFw > > [7] Supporting guest private memory in Protected KVM on Android (presentation) > https://lpc.events/event/17/contributions/1487/ > > [8] Drivers for Gunyah (patch series) > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@xxxxxxxxxxx/ As of 5 min ago when I send this, there's a v17: https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@xxxxxxxxxxx/ Thanks, Elliot