On Tue, Aug 20, 2024, Ilias Stamatis wrote: > The current MMIO coalescing design has a few drawbacks which limit its > usefulness. Currently all coalesced MMIO zones use the same ring buffer. > That means that upon a userspace exit we have to handle potentially > unrelated MMIO writes synchronously. And a VM-wide lock needs to be > taken in the kernel when an MMIO exit occurs. > > Additionally, there is no direct way for userspace to be notified about > coalesced MMIO writes. If the next MMIO exit to userspace is when the > ring buffer has filled then a substantial (and unbounded) amount of time > may have passed since the first coalesced MMIO. > > Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply Why allocate the buffer in KVM? It allows reusing coalesced_mmio_write() without needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a bit in fact) to KVM and inherits many of the restrictions and warts of the existing coalesced I/O implementation. E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel page size. Why not go with a BYOB model? E.g. so that userspace can explicitly define the buffer size to best fit the properties of the I/O range, and to avoid additional complexity in KVM. > returns a file descriptor to the caller but does not allocate a ring > buffer. Userspace can then pass this fd to mmap() to actually allocate a > buffer and map it to its address space. > > Subsequent patches will allow userspace to: > > - Associate the fd with a coalescing zone when registering it so that > writes to that zone are accumulated in that specific ring buffer > rather than the VM-wide one. > - Poll for MMIO writes using this fd. Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for doorbells. The only use case I can think of is where the doorbell sits in the same region as the data payload, but that essentially just saves an entry in KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of the region). > > Signed-off-by: Ilias Stamatis <ilstam@xxxxxxxxxx> > Reviewed-by: Paul Durrant <paul@xxxxxxx> > --- ... > +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma) > +{ > + struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data; > + > + spin_lock(&dev->ring_lock); > + > + vfree(dev->ring); > + dev->ring = NULL; > + > + spin_unlock(&dev->ring_lock); I doubt this is correct. I don't see VM_DONTCOPY, and so userspace can create a second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs is closed. > +} > + > +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = { > + .close = coalesced_mmio_buffer_vma_close, > +}; > + > +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data; > + unsigned long pfn; > + int ret = 0; > + > + spin_lock(&dev->ring_lock); > + > + if (dev->ring) { Restricting userspace to a single mmap() is a very bizarre restriction. > + ret = -EBUSY; > + goto out_unlock; > + }