Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Thu, 21 Oct 2010 10:39:07 +0200

On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> 
> The basic idea of kernel and QEmu's responsibilities are:
> 
> 1. Because QEmu owned the irq routing table, so the change of table should still 
> go to the QEmu, like we did in msix_mmio_write().
> 
> 2. And the others things can be done in kernel, for performance. Here we covered 
> the reading(converted entry from routing table and mask bit state of enabled MSI-X 
> entries), and writing the mask bit for enabled MSI-X entries. Originally we only 
> has mask bit handled in kernel, but later we found that Linux kernel would read 
> MSI-X mmio just after every writing to mask bit, in order to flush the writing. So 
> we add reading MSI data/addr as well.
> 
> 3. Disabled entries's mask bit accessing would go to QEmu, because it may result 
> in disable/enable MSI-X. Explained later.
> 
> 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to 
> decide enable/disable MSI-X for device.
> .

Config space yes, but it's a simple global yes/no after all.

> 5. There is an distinction between enabled entry and disabled entry of MSI-X 
> table.

That's my point. There's no such thing as 'enabled entries'
in the spec. There are only masked and unmasked entries.

Current interface deals with gsi numbers so qemu had to work around
this. The hack used there is removing gsi for masked vector which has 0
address and data.  It works because this is what linux and windows
guests happen to do, but it is out of spec: vector/data value for a
masked entry have no meaning.

Since you are building a new interface, can design it without
constraints...

> The entries we had used for pci_enable_msix()(not necessary in sequence 
> number) are already enabled, the others are disabled. When device's MSI-X is 
> enabled and guest want to enable an disabled entry, we would go back to QEmu  
> because this vector didn't exist in the routing table. Also due to 
> pci_enable_msix() in kernel didn't allow us to enable vectors one by one, but all 
> at once. So we have to disable MSI-X first, then enable it with new entries, which 
> contained the new vector guest want to use. This situation is only happen when 
> device is being initialized. After that, kernel can know and handle the mask bit 
> of the enabled entry.
> 
> I've also considered handle all MMIO operation in kernel, and changing irq routing 
> in kernel directly. But as long as irq routing is owned by QEmu, I think it's 
> better to leave to it...

Yes, this is my suggestion, except we don't need no routing :)
To inject MSI you just need address/data pairs.
Look at kvm_set_msi: given address/data you can just inject
the interrupt. No need for table lookups.

> Notice the mask/unmask bits must be handled together, either in kernel or in 
> userspace. Because if kernel has handled enabled vector's mask bit directly, it 
> would be unsync with QEmu's records. It doesn't matter when QEmu don't access the 
> related record. And the only place QEmu want to consult it's enabled entries' mask 
> bit state is writing to MSI addr/data. The writing should be discarded if the 
> entry is unmasked. This checking has already been done by kernel in this patchset, 
> so we are fine here.
> 
> If we want to access the enabled entries' mask bit in the future, we can directly 
> access device's MMIO.

We really must implement this for correctness, btw. If you do not pass
reads to the device, messages intended for the masked entry
might still be in flight.

> That's the reason why I have followed Michael's advice to use
> mask/unmask directly.
> Hope this would make the patches more clear. I meant to add comments for this 
> changeset, but miss it later.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx>
> > ---
> >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> >  arch/x86/kvm/x86.c        |    6 ++++
> >  include/linux/kvm.h       |    8 +++++-
> >  virt/kvm/assigned-dev.c   |   60
> > +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 95
> > insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > index d82d637..f324a50 100644
> > --- a/Documentation/kvm/api.txt
> > +++ b/Documentation/kvm/api.txt
> > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> >  If any additional field gets added to this structure later on, a bit for
> > that additional piece of information will be set in the flags bitmap.
> > 
> > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > +
> > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_assigned_msix_mmio (in)
> > +Returns: 0 on success, !0 on error
> > +
> > +struct kvm_assigned_msix_mmio {
> > +	/* Assigned device's ID */
> > +	__u32 assigned_dev_id;
> > +	/* MSI-X table MMIO address */
> > +	__u64 base_addr;
> > +	/* Must be 0 */
> > +	__u32 flags;
> > +	/* Must be 0, reserved for future use */
> > +	__u64 reserved;
> > +};
> > +
> > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > MSI-X +mask bit in the kernel.
> > +
> >  5. The kvm_run structure
> > 
> >  Application code obtains a pointer to the kvm_run structure by
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index fc62546..ba07a2f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> >  	case KVM_CAP_XSAVE:
> >  	case KVM_CAP_ENABLE_CAP:
> > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > +	case KVM_CAP_DEVICE_MSIX_MASK:
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_COALESCED_MMIO:
> > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu
> > *vcpu, return -EINVAL;
> > 
> >  	switch (cap->cap) {
> > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > +		r = 0;
> > +		break;
> >  	default:
> >  		r = -EINVAL;
> >  		break;
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 0a7bd34..1494ed0 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> >  #endif
> >  #define KVM_CAP_PPC_GET_PVINFO 57
> >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > +#ifdef __KVM_HAVE_MSIX
> > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > +#endif
> > 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b,
> > struct kvm_clock_data) #define KVM_GET_CLOCK             _IOR(KVMIO, 
> > 0x7c, struct kvm_clock_data) +#define KVM_ASSIGN_REG_MSIX_MMIO 
> > _IOW(KVMIO,  0x7d, \
> > +					struct kvm_assigned_msix_mmio)
> >  /* Available with KVM_CAP_PIT_STATE2 */
> >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct kvm_assigned_msix_mmio {
> >  	__u32 assigned_dev_id;
> >  	__u64 base_addr;
> >  	__u32 flags;
> > -	__u32 reserved[2];
> > +	__u64 reserved;
> >  };
> > 
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 5d2adc4..9573194 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -17,6 +17,8 @@
> >  #include <linux/pci.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/slab.h>
> > +#include <linux/irqnr.h>
> > +
> >  #include "irq.h"
> > 
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > list_head *head, @@ -169,6 +171,14 @@ static void deassign_host_irq(struct
> > kvm *kvm, */
> >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> >  		int i;
> > +#ifdef __KVM_HAVE_MSIX
> > +		if (assigned_dev->msix_mmio_base) {
> > +			mutex_lock(&kvm->slots_lock);
> > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > +					&assigned_dev->msix_mmio_dev);
> > +			mutex_unlock(&kvm->slots_lock);
> > +		}
> > +#endif
> >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> >  			disable_irq_nosync(assigned_dev->
> >  					   host_msix_entries[i].vector);
> > @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct kvm
> > *kvm, goto err;
> >  	}
> > 
> > +	if (dev->msix_mmio_base) {
> > +		mutex_lock(&kvm->slots_lock);
> > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > +				&dev->msix_mmio_dev);
> > +		mutex_unlock(&kvm->slots_lock);
> > +		if (r)
> > +			goto err;
> > +	}
> > +
> >  	return 0;
> >  err:
> >  	for (i -= 1; i >= 0; i--)
> > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops msix_mmio_ops =
> > { .write    = msix_mmio_write,
> >  };
> > 
> > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      msix_mmio->assigned_dev_id);
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (msix_mmio->base_addr == 0) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > +
> > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> >  #endif
> > 
> >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > unsigned ioctl, goto out;
> >  		break;
> >  	}
> > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > +		struct kvm_assigned_msix_mmio msix_mmio;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > +		if (r)
> > +			goto out;
> > +		break;
> > +	}
> >  #endif
> >  	}
> >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html