On 06/07/22 13:50, Chao Peng wrote: > This is the v7 of this series which tries to implement the fd-based KVM > guest private memory. The patches are based on latest kvm/queue branch > commit: > > b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU > split_desc_cache only by default capacity > > Introduction > ------------ > In general this patch series introduce fd-based memslot which provides > guest memory through memory file descriptor fd[offset,size] instead of > hva/size. The fd can be created from a supported memory filesystem > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM > and the the memory backing store exchange callbacks when such memslot > gets created. At runtime KVM will call into callbacks provided by the > backing store to get the pfn with the fd+offset. Memory backing store > will also call into KVM callbacks when userspace punch hole on the fd > to notify KVM to unmap secondary MMU page table entries. > > Comparing to existing hva-based memslot, this new type of memslot allows > guest memory unmapped from host userspace like QEMU and even the kernel > itself, therefore reduce attack surface and prevent bugs. > > Based on this fd-based memslot, we can build guest private memory that > is going to be used in confidential computing environments such as Intel > TDX and AMD SEV. When supported, the memory backing store can provide > more enforcement on the fd and KVM can use a single memslot to hold both > the private and shared part of the guest memory. > > mm extension > --------------------- > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file > created with these flags cannot read(), write() or mmap() etc via normal > MMU operations. The file content can only be used with the newly > introduced memfile_notifier extension. > > The memfile_notifier extension provides two sets of callbacks for KVM to > interact with the memory backing store: > - memfile_notifier_ops: callbacks for memory backing store to notify > KVM when memory gets invalidated. > - backing store callbacks: callbacks for KVM to call into memory > backing store to request memory pages for guest private memory. > > The memfile_notifier extension also provides APIs for memory backing > store to register/unregister itself and to trigger the notifier when the > bookmarked memory gets invalidated. > > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to > prevent double allocation caused by unintentional guest when we only > have a single side of the shared/private memfds effective. > > memslot extension > ----------------- > Add the private fd and the fd offset to existing 'shared' memslot so > that both private/shared guest memory can live in one single memslot. > A page in the memslot is either private or shared. Whether a guest page > is private or shared is maintained through reusing existing SEV ioctls > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. > > Test > ---- > To test the new functionalities of this patch TDX patchset is needed. > Since TDX patchset has not been merged so I did two kinds of test: > > - Regresion test on kvm/queue (this patchset) > Most new code are not covered. Code also in below repo: > https://github.com/chao-p/linux/tree/privmem-v7 > > - New Funational test on latest TDX code > The patch is rebased to latest TDX code and tested the new > funcationalities. See below repos: > Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx > QEMU: https://github.com/chao-p/qemu/tree/privmem-v7 While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds: diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name); - fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + } However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory? > An example QEMU command line for TDX test: > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ > -machine confidential-guest-support=tdx \ > -object memory-backend-memfd-private,id=ram1,size=${mem} \ > -machine memory-backend=ram1 > Regards, Nikunj