On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote: > Let's try actually counting syscalls and mode transitions, at least approximately. For non-direct IO (DMA allocation on guest side, not straight to/from pagecache or similar): > > Guest writes to shared DMA buffer. Assume the guest is smart and reuses the buffer. > Guest writes descriptor to shared virtio ring. > Guest rings virtio doorbell, which causes an exit. > *** guest -> hypervisor -> host *** > host reads virtio ring (mmaped shared memory) > host does pread() to read the DMA buffer or reads mmapped buffer > host does the IO > resume guest > *** host -> hypervisor -> guest *** > > This is essentially optimal in terms of transitions. The data is copied on the guest side (which may well be mandatory depending on what guest userspace did to initiate the IO) and on the host (which may well be mandatory depending on what the host is doing with the data). > > Now let's try straight-from-guest-pagecache or otherwise zero-copy on the guest side. Without nondestructive changes, the guest needs a bounce buffer and it looks just like the above. One extra copy, zero extra mode transitions. With nondestructive changes, it's a bit more like physical hardware with an IOMMU: > > Guest shares the page. > *** guest -> hypervisor *** > Hypervisor adds a PTE. Let's assume we're being very optimal and the host is not synchronously notified. > *** hypervisor -> guest *** > Guest writes descriptor to shared virtio ring. > Guest rings virtio doorbell, which causes an exit. > *** guest -> hypervisor -> host *** > host reads virtio ring (mmaped shared memory) > > mmap *** syscall *** > host does the IO > munmap *** syscall, TLBI *** > > resume guest > *** host -> hypervisor -> guest *** > Guest unshares the page. > *** guest -> hypervisor *** > Hypervisor removes PTE. TLBI. > *** hypervisor -> guest *** > > This is quite expensive. For small IO, pread() or splice() in the host may be a lot faster. Even for large IO, splice() may still win. Right, that would work nicely for pages that are shared transiently, but less so for long-term shares. But I guess your proposal below should do the trick. > I can imagine clever improvements. First, let's get rid of mmap() + munmap(). Instead use a special device mapping with special semantics, not regular memory. (mmap and munmap are expensive even ignoring any arch and TLB stuff.) The rule is that, if the page is shared, access works, and if private, access doesn't, but it's still mapped. The hypervisor and the host cooperate to make it so. As long as the page can't be GUP'd I _think_ this shouldn't be a problem. We can have the hypervisor re-inject the fault in the host. And the host fault handler will deal with it just fine if the fault was taken from userspace (inject a SEGV), or from the kernel through uaccess macros. But we do get into issues if the host kernel can be tricked into accessing the page via e.g. kmap(). I've been able to trigger this by strace-ing a userspace process which passes a pointer to private memory to a syscall. strace will inspect the syscall argument using process_vm_readv(), which will pin_user_pages_remote() and access the page via kmap(), and then we're in trouble. But preventing GUP would prevent this by construction I think? FWIW memfd_secret() did look like a good solution to this, but it lacks the bidirectional notifiers with KVM that is offered by this patch series, which is needed to allow KVM to handle guest faults, and also offers a good framework to support future extensions (e.g. hypervisor-assisted page migration, swap, ...). So yes, ideally pKVM would use a kind of hybrid between memfd_secret and the private fd proposed here, or something else providing similar properties. Thanks, Quentin