On 11/08/22 17:00, Gupta, Pankaj wrote: > >>> This is the v7 of this series which tries to implement the fd-based KVM >>> guest private memory. The patches are based on latest kvm/queue branch >>> commit: >>> >>> b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU >>> split_desc_cache only by default capacity >>> >>> Introduction >>> ------------ >>> In general this patch series introduce fd-based memslot which provides >>> guest memory through memory file descriptor fd[offset,size] instead of >>> hva/size. The fd can be created from a supported memory filesystem >>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM >>> and the the memory backing store exchange callbacks when such memslot >>> gets created. At runtime KVM will call into callbacks provided by the >>> backing store to get the pfn with the fd+offset. Memory backing store >>> will also call into KVM callbacks when userspace punch hole on the fd >>> to notify KVM to unmap secondary MMU page table entries. >>> >>> Comparing to existing hva-based memslot, this new type of memslot allows >>> guest memory unmapped from host userspace like QEMU and even the kernel >>> itself, therefore reduce attack surface and prevent bugs. >>> >>> Based on this fd-based memslot, we can build guest private memory that >>> is going to be used in confidential computing environments such as Intel >>> TDX and AMD SEV. When supported, the memory backing store can provide >>> more enforcement on the fd and KVM can use a single memslot to hold both >>> the private and shared part of the guest memory. >>> >>> mm extension >>> --------------------- >>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file >>> created with these flags cannot read(), write() or mmap() etc via normal >>> MMU operations. The file content can only be used with the newly >>> introduced memfile_notifier extension. >>> >>> The memfile_notifier extension provides two sets of callbacks for KVM to >>> interact with the memory backing store: >>> - memfile_notifier_ops: callbacks for memory backing store to notify >>> KVM when memory gets invalidated. >>> - backing store callbacks: callbacks for KVM to call into memory >>> backing store to request memory pages for guest private memory. >>> >>> The memfile_notifier extension also provides APIs for memory backing >>> store to register/unregister itself and to trigger the notifier when the >>> bookmarked memory gets invalidated. >>> >>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to >>> prevent double allocation caused by unintentional guest when we only >>> have a single side of the shared/private memfds effective. >>> >>> memslot extension >>> ----------------- >>> Add the private fd and the fd offset to existing 'shared' memslot so >>> that both private/shared guest memory can live in one single memslot. >>> A page in the memslot is either private or shared. Whether a guest page >>> is private or shared is maintained through reusing existing SEV ioctls >>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. >>> >>> Test >>> ---- >>> To test the new functionalities of this patch TDX patchset is needed. >>> Since TDX patchset has not been merged so I did two kinds of test: >>> >>> - Regresion test on kvm/queue (this patchset) >>> Most new code are not covered. Code also in below repo: >>> https://github.com/chao-p/linux/tree/privmem-v7 >>> >>> - New Funational test on latest TDX code >>> The patch is rebased to latest TDX code and tested the new >>> funcationalities. See below repos: >>> Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx >>> QEMU: https://github.com/chao-p/qemu/tree/privmem-v7 >> >> While debugging an issue with SEV+UPM, found that fallocate() returns >> an error in QEMU which is not handled (EINTR). With the below handling >> of EINTR subsequent fallocate() succeeds: >> >> >> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c >> index af8fb0c957..e8597ed28d 100644 >> --- a/backends/hostmem-memfd-private.c >> +++ b/backends/hostmem-memfd-private.c >> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) >> MachineState *machine = MACHINE(qdev_get_machine()); >> uint32_t ram_flags; >> char *name; >> - int fd, priv_fd; >> + int fd, priv_fd, ret; >> if (!backend->size) { >> error_setg(errp, "can't create backend with size 0"); >> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) >> backend->size, ram_flags, fd, 0, errp); >> g_free(name); >> - fallocate(priv_fd, 0, 0, backend->size); >> +again: >> + ret = fallocate(priv_fd, 0, 0, backend->size); >> + if (ret) { >> + perror("Fallocate failed: \n"); >> + if (errno == EINTR) >> + goto again; >> + else >> + exit(1); >> + } >> >> However, fallocate() preallocates full guest memory before starting the guest. >> With this behaviour guest memory is *not* demand pinned. Is there a way to >> prevent fallocate() from reserving full guest memory? > > Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}. That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated. Regards Nikunj