Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Fri, 22 Oct 2010 12:17:05 +0200

On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > The basic idea of kernel and QEmu's responsibilities are:
> > > 
> > > 1. Because QEmu owned the irq routing table, so the change of table
> > > should still go to the QEmu, like we did in msix_mmio_write().
> > > 
> > > 2. And the others things can be done in kernel, for performance. Here we
> > > covered the reading(converted entry from routing table and mask bit
> > > state of enabled MSI-X entries), and writing the mask bit for enabled
> > > MSI-X entries. Originally we only has mask bit handled in kernel, but
> > > later we found that Linux kernel would read MSI-X mmio just after every
> > > writing to mask bit, in order to flush the writing. So we add reading
> > > MSI data/addr as well.
> > > 
> > > 3. Disabled entries's mask bit accessing would go to QEmu, because it may
> > > result in disable/enable MSI-X. Explained later.
> > > 
> > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to
> > > decide enable/disable MSI-X for device.
> > > .
> > 
> > Config space yes, but it's a simple global yes/no after all.
> > 
> > > 5. There is an distinction between enabled entry and disabled entry of
> > > MSI-X table.
> > 
> > That's my point. There's no such thing as 'enabled entries'
> > in the spec. There are only masked and unmasked entries.
> >
> > Current interface deals with gsi numbers so qemu had to work around
> > this. The hack used there is removing gsi for masked vector which has 0
> > address and data.  It works because this is what linux and windows
> > guests happen to do, but it is out of spec: vector/data value for a
> > masked entry have no meaning.
> 
> Well, I just realized something unnatural about the 0 contained data/address 
> entry. So I checked spec again, and found the mask bit should be set after reset. 
> So after fix this, I think unmasked 0 address/data entry shouldn't be there 
> anymore.

You are right that this 0 check is completely out of spec.  But see
below. The issue is the spec does not require  you to mask an entry if
the device does not use it. And we do not want to waste host vectors.
One can imagine a logic where we would detect that an interrupt
has not been used in a long while, mask it and give up
a host vector. Then check the pending bit once in a while to
see whether device started using it.

> > Since you are building a new interface, can design it without
> > constraints...
> 
> A constraint is pci_enable_msix().

It's an internal kernel API. No one prevents us from changing it.

> We have to use it to allocate irq for each 
> entry, as well as program the entry in the real hardware. pci_enable_msix() is 
> only a yes/no choice. We can't add new enabled entries after pci_enable_msix(), 

With the current APIs.

> and we can only enable/disable/mask/unmask one IRQ through kernel API, not the 
> entry in the MSI-X table. And we still have to allocate new IRQ for new entry.  
> When guest unmask "disabled entry", we have to disable and enable MSI-X again in 
> order to use the new entry. That's why "enabled/disabled entry" concept existed.
> 
> So even guest only unmasked one entry, it's a totally different work for KVM 
> underlaying. This logic won't change no matter where the MMIO handling is. And in 
> fact I don't like this kind of tricky things in kernel...

A more fundamental problem is that host vectors are a limited resource,
we don't want to waste them on entries that will end up unused.
One can imagine some kind of logic where we check the pending
bit on a masked entry and after a while give up the host vector.

This is what I said: making it spec compliant is harder as it will
need core kernel changes. Still, it seems silly to design a
kerne/userspace API around an internal API limitation ...

> > > The entries we had used for pci_enable_msix()(not necessary in sequence
> > > number) are already enabled, the others are disabled. When device's MSI-X
> > > is enabled and guest want to enable an disabled entry, we would go back
> > > to QEmu because this vector didn't exist in the routing table. Also due
> > > to pci_enable_msix() in kernel didn't allow us to enable vectors one by
> > > one, but all at once. So we have to disable MSI-X first, then enable it
> > > with new entries, which contained the new vector guest want to use. This
> > > situation is only happen when device is being initialized. After that,
> > > kernel can know and handle the mask bit of the enabled entry.
> > > 
> > > I've also considered handle all MMIO operation in kernel, and changing
> > > irq routing in kernel directly. But as long as irq routing is owned by
> > > QEmu, I think it's better to leave to it...
> > 
> > Yes, this is my suggestion, except we don't need no routing :)
> > To inject MSI you just need address/data pairs.
> > Look at kvm_set_msi: given address/data you can just inject
> > the interrupt. No need for table lookups.
> 
> You still need to look up data/address pair in the guest MSI-X table. The routing 
> table used here is just an replacement for the table, because we can construct the 
> entry according to the routing table. Two choices, using the routing table, or 
> creating an new MSI-X table. 
> 
> Still, the key is about who to own the routing/MSI-X table. If kernel own it, it 
> would be straightforward to intercept all the MMIO in the kernel; but if it's 
> QEmu, we still need go back to QEmu for it.

Looks cleaner to do it in kernel...

> > > Notice the mask/unmask bits must be handled together, either in kernel or
> > > in userspace. Because if kernel has handled enabled vector's mask bit
> > > directly, it would be unsync with QEmu's records. It doesn't matter when
> > > QEmu don't access the related record. And the only place QEmu want to
> > > consult it's enabled entries' mask bit state is writing to MSI
> > > addr/data. The writing should be discarded if the entry is unmasked.
> > > This checking has already been done by kernel in this patchset, so we
> > > are fine here.
> > > 
> > > If we want to access the enabled entries' mask bit in the future, we can
> > > directly access device's MMIO.
> > 
> > We really must implement this for correctness, btw. If you do not pass
> > reads to the device, messages intended for the masked entry
> > might still be in flight.
> 
> Oh, yes, kernel would also mask the device as well. I would take this into 
> consideration.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > That's the reason why I have followed Michael's advice to use
> > > mask/unmask directly.
> > > Hope this would make the patches more clear. I meant to add comments for
> > > this changeset, but miss it later.
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx>
> > > > ---
> > > > 
> > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > >  include/linux/kvm.h       |    8 +++++-
> > > >  virt/kvm/assigned-dev.c   |   60
> > > > 
> > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 
> 95
> > > > insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > index d82d637..f324a50 100644
> > > > --- a/Documentation/kvm/api.txt
> > > > +++ b/Documentation/kvm/api.txt
> > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > 
> > > >  If any additional field gets added to this structure later on, a bit
> > > >  for
> > > > 
> > > > that additional piece of information will be set in the flags bitmap.
> > > > 
> > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > +
> > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > +Returns: 0 on success, !0 on error
> > > > +
> > > > +struct kvm_assigned_msix_mmio {
> > > > +	/* Assigned device's ID */
> > > > +	__u32 assigned_dev_id;
> > > > +	/* MSI-X table MMIO address */
> > > > +	__u64 base_addr;
> > > > +	/* Must be 0 */
> > > > +	__u32 flags;
> > > > +	/* Must be 0, reserved for future use */
> > > > +	__u64 reserved;
> > > > +};
> > > > +
> > > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > > MSI-X +mask bit in the kernel.
> > > > +
> > > > 
> > > >  5. The kvm_run structure
> > > >  
> > > >  Application code obtains a pointer to the kvm_run structure by
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index fc62546..ba07a2f 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > >  	case KVM_CAP_XSAVE:
> > > > 
> > > >  	case KVM_CAP_ENABLE_CAP:
> > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > 
> > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > >  		r = 1;
> > > >  		break;
> > > >  	
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > 
> > > >  	switch (cap->cap) {
> > > > 
> > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > +		r = 0;
> > > > +		break;
> > > > 
> > > >  	default:
> > > >  		r = -EINVAL;
> > > >  		break;
> > > > 
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 0a7bd34..1494ed0 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > +#endif
> > > > 
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > 
> > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > 
> > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > 
> > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > +					struct kvm_assigned_msix_mmio)
> > > > 
> > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > 
> > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > > > struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > kvm_assigned_msix_mmio {
> > > > 
> > > >  	__u32 assigned_dev_id;
> > > >  	__u64 base_addr;
> > > >  	__u32 flags;
> > > > 
> > > > -	__u32 reserved[2];
> > > > +	__u64 reserved;
> > > > 
> > > >  };
> > > >  
> > > >  #endif /* __LINUX_KVM_H */
> > > > 
> > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > index 5d2adc4..9573194 100644
> > > > --- a/virt/kvm/assigned-dev.c
> > > > +++ b/virt/kvm/assigned-dev.c
> > > > @@ -17,6 +17,8 @@
> > > > 
> > > >  #include <linux/pci.h>
> > > >  #include <linux/interrupt.h>
> > > >  #include <linux/slab.h>
> > > > 
> > > > +#include <linux/irqnr.h>
> > > > +
> > > > 
> > > >  #include "irq.h"
> > > >  
> > > >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > > > 
> > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > deassign_host_irq(struct kvm *kvm, */
> > > > 
> > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > >  	
> > > >  		int i;
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +		if (assigned_dev->msix_mmio_base) {
> > > > +			mutex_lock(&kvm->slots_lock);
> > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > +					&assigned_dev->msix_mmio_dev);
> > > > +			mutex_unlock(&kvm->slots_lock);
> > > > +		}
> > > > +#endif
> > > > 
> > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > >  		
> > > >  			disable_irq_nosync(assigned_dev->
> > > >  			
> > > >  					   host_msix_entries[i].vector);
> > > > 
> > > > @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct
> > > > kvm *kvm, goto err;
> > > > 
> > > >  	}
> > > > 
> > > > +	if (dev->msix_mmio_base) {
> > > > +		mutex_lock(&kvm->slots_lock);
> > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > +				&dev->msix_mmio_dev);
> > > > +		mutex_unlock(&kvm->slots_lock);
> > > > +		if (r)
> > > > +			goto err;
> > > > +	}
> > > > +
> > > > 
> > > >  	return 0;
> > > >  
> > > >  err:
> > > >  	for (i -= 1; i >= 0; i--)
> > > > 
> > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > 
> > > >  };
> > > > 
> > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      msix_mmio->assigned_dev_id);
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->base_addr == 0) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > +
> > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > 
> > > >  #endif
> > > >  
> > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > 
> > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > *kvm, unsigned ioctl, goto out;
> > > > 
> > > >  		break;
> > > >  	
> > > >  	}
> > > > 
> > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > +			goto out;
> > > > +
> > > > +		r = -EINVAL;
> > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		break;
> > > > +	}
> > > > 
> > > >  #endif
> > > >  
> > > >  	}
> > > >  
> > > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html