Re: [PATCH v3 2/6] KVM: Add KVM_CREATE_COALESCED_MMIO_BUFFER ioctl

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 28 Aug 2024 07:25:26 -0700

On Tue, Aug 20, 2024, Ilias Stamatis wrote:
> The current MMIO coalescing design has a few drawbacks which limit its
> usefulness. Currently all coalesced MMIO zones use the same ring buffer.
> That means that upon a userspace exit we have to handle potentially
> unrelated MMIO writes synchronously. And a VM-wide lock needs to be
> taken in the kernel when an MMIO exit occurs.
> 
> Additionally, there is no direct way for userspace to be notified about
> coalesced MMIO writes. If the next MMIO exit to userspace is when the
> ring buffer has filled then a substantial (and unbounded) amount of time
> may have passed since the first coalesced MMIO.
> 
> Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply

Why allocate the buffer in KVM?  It allows reusing coalesced_mmio_write() without
needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a
bit in fact) to KVM and inherits many of the restrictions and warts of the existing
coalesced I/O implementation.

E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel
page size.

Why not go with a BYOB model?  E.g. so that userspace can explicitly define the
buffer size to best fit the properties of the I/O range, and to avoid additional
complexity in KVM.

> returns a file descriptor to the caller but does not allocate a ring
> buffer. Userspace can then pass this fd to mmap() to actually allocate a
> buffer and map it to its address space.
> 
> Subsequent patches will allow userspace to:
> 
> - Associate the fd with a coalescing zone when registering it so that
>   writes to that zone are accumulated in that specific ring buffer
>   rather than the VM-wide one.
> - Poll for MMIO writes using this fd.

Why?  I get the desire for a doorbell, but KVM already supports "fast" I/O for
doorbells.  The only use case I can think of is where the doorbell sits in the
same region as the data payload, but that essentially just saves an entry in
KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of
the region).

> 
> Signed-off-by: Ilias Stamatis <ilstam@xxxxxxxxxx>
> Reviewed-by: Paul Durrant <paul@xxxxxxx>
> ---

...

> +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma)
> +{
> +	struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data;
> +
> +	spin_lock(&dev->ring_lock);
> +
> +	vfree(dev->ring);
> +	dev->ring = NULL;
> +
> +	spin_unlock(&dev->ring_lock);

I doubt this is correct.  I don't see VM_DONTCOPY, and so userspace can create a
second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs
is closed.

> +}
> +
> +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = {
> +	.close = coalesced_mmio_buffer_vma_close,
> +};
> +
> +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data;
> +	unsigned long pfn;
> +	int ret = 0;
> +
> +	spin_lock(&dev->ring_lock);
> +
> +	if (dev->ring) {

Restricting userspace to a single mmap() is a very bizarre restriction.

> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}