On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote: > On Tue, Aug 20, 2024, Ilias Stamatis wrote: > > The current MMIO coalescing design has a few drawbacks which limit its > > usefulness. Currently all coalesced MMIO zones use the same ring buffer. > > That means that upon a userspace exit we have to handle potentially > > unrelated MMIO writes synchronously. And a VM-wide lock needs to be > > taken in the kernel when an MMIO exit occurs. > > > > Additionally, there is no direct way for userspace to be notified about > > coalesced MMIO writes. If the next MMIO exit to userspace is when the > > ring buffer has filled then a substantial (and unbounded) amount of time > > may have passed since the first coalesced MMIO. > > > > Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply > > Why allocate the buffer in KVM? It allows reusing coalesced_mmio_write() without > needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a > bit in fact) to KVM and inherits many of the restrictions and warts of the existing > coalesced I/O implementation. > > E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel > page size. > > Why not go with a BYOB model? E.g. so that userspace can explicitly define the > buffer size to best fit the properties of the I/O range, and to avoid additional > complexity in KVM. Fair enough. I used the original implementation as a guide, but let's do it Right™ this time. > > > returns a file descriptor to the caller but does not allocate a ring > > buffer. Userspace can then pass this fd to mmap() to actually allocate a > > buffer and map it to its address space. > > > > Subsequent patches will allow userspace to: > > > > - Associate the fd with a coalescing zone when registering it so that > > writes to that zone are accumulated in that specific ring buffer > > rather than the VM-wide one. > > - Poll for MMIO writes using this fd. > > Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for > doorbells. What do you refer to here? ioeventfd? Correct me if I am wrong, but my understanding is that with an ioeventfd the write value is not available. And that is a problem when that value is a head or tail pointer. I also realised that I need to improve my commit messages. > The only use case I can think of is where the doorbell sits in the > same region as the data payload, but that essentially just saves an entry in > KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of > the region). > > > > > Signed-off-by: Ilias Stamatis <ilstam@xxxxxxxxxx> > > Reviewed-by: Paul Durrant <paul@xxxxxxx> > > --- > > ... > > > > +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma) > > +{ > > + struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data; > > + > > + spin_lock(&dev->ring_lock); > > + > > + vfree(dev->ring); > > + dev->ring = NULL; > > + > > + spin_unlock(&dev->ring_lock); > > I doubt this is correct. I don't see VM_DONTCOPY, and so userspace can create a > second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs > is closed. Oops. Will fix. > > > +} > > + > > +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = { > > + .close = coalesced_mmio_buffer_vma_close, > > +}; > > + > > +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma) > > +{ > > + struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data; > > + unsigned long pfn; > > + int ret = 0; > > + > > + spin_lock(&dev->ring_lock); > > + > > + if (dev->ring) { > > Restricting userspace to a single mmap() is a very bizarre restriction. > Hmm, you are right. Even if the buffer is allocated in the kernel, and one was allocated already we should let the user map it to multiple addresses. I will fix that too. > > + ret = -EBUSY; > > + goto out_unlock; > > + } Thanks for the review. Ilias