Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Fri, 22 Oct 2010 16:32:23 +0200

On Fri, Oct 22, 2010 at 09:30:09PM +0800, Sheng Yang wrote:
> On Friday 22 October 2010 18:17:05 Michael S. Tsirkin wrote:
> > On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> > > On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > > > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > > 
> > > > > The basic idea of kernel and QEmu's responsibilities are:
> > > > > 
> > > > > 1. Because QEmu owned the irq routing table, so the change of table
> > > > > should still go to the QEmu, like we did in msix_mmio_write().
> > > > > 
> > > > > 2. And the others things can be done in kernel, for performance. Here
> > > > > we covered the reading(converted entry from routing table and mask
> > > > > bit state of enabled MSI-X entries), and writing the mask bit for
> > > > > enabled MSI-X entries. Originally we only has mask bit handled in
> > > > > kernel, but later we found that Linux kernel would read MSI-X mmio
> > > > > just after every writing to mask bit, in order to flush the writing.
> > > > > So we add reading MSI data/addr as well.
> > > > > 
> > > > > 3. Disabled entries's mask bit accessing would go to QEmu, because it
> > > > > may result in disable/enable MSI-X. Explained later.
> > > > > 
> > > > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu
> > > > > to decide enable/disable MSI-X for device.
> > > > > .
> > > > 
> > > > Config space yes, but it's a simple global yes/no after all.
> > > > 
> > > > > 5. There is an distinction between enabled entry and disabled entry
> > > > > of MSI-X table.
> > > > 
> > > > That's my point. There's no such thing as 'enabled entries'
> > > > in the spec. There are only masked and unmasked entries.
> > > > 
> > > > Current interface deals with gsi numbers so qemu had to work around
> > > > this. The hack used there is removing gsi for masked vector which has 0
> > > > address and data.  It works because this is what linux and windows
> > > > guests happen to do, but it is out of spec: vector/data value for a
> > > > masked entry have no meaning.
> > > 
> > > Well, I just realized something unnatural about the 0 contained
> > > data/address entry. So I checked spec again, and found the mask bit
> > > should be set after reset. So after fix this, I think unmasked 0
> > > address/data entry shouldn't be there anymore.
> > 
> > You are right that this 0 check is completely out of spec.  But see
> > below. The issue is the spec does not require  you to mask an entry if
> > the device does not use it. And we do not want to waste host vectors.
> > One can imagine a logic where we would detect that an interrupt
> > has not been used in a long while, mask it and give up
> > a host vector. Then check the pending bit once in a while to
> > see whether device started using it.
> 
> I don't think introducing this complex and speculative logic makes sense. I 
> haven't seen any alike scenario yet. What's the issue of current implemenation?

Main issue I think is that current implementation assumes that when
MSI-X is enabled, all vectors that are masked or have 0 value will not be
used anymore.  It says so no where in spec.

> > > > Since you are building a new interface, can design it without
> > > > constraints...
> > > 
> > > A constraint is pci_enable_msix().
> > 
> > It's an internal kernel API. No one prevents us from changing it.
> 
> You can say no one. But I'm afraid if we want to overhaul this kind of core PCI 
> functions, it may be take months to get it checked in upstream - also assume we 
> can persuade them this overhaul is absolutely needed (I hope I'm wrong on this). 
> This may also means we have to find out another user for this kind of change.

No, usually a single user is enough to add internal APIs :)
This is because we do not promise backwards compatibility there.

> The 
> key issue is I don't know what we can gain for certain from it. Current 
> disable/enable mechanism still works well.

Heh, given that things seem to work without mask support
at all, I am not sure what does this mean.

> I don't know why we need spend a lot of 
> effort on this just because spec don't say there is "enabled/disabled entries". 

I would like to see an implementation using 100% architectural features.
It's enough that we need to worry about device quirks. Relying on guests
to only use them in a specific way means each bug we'll have to worry
whether guest is doing something we did not expect.

If something is just too hard to implement, a temporary work
around would be to still build an interfae to allow this,
and then add a trace in kernel making this easy to detect.
Then we can fix this fully in the next version without
affecting userspace.

> Yes it's not that elegant, but we need carefully think if the effort worth it.

What I think we should be careful about is making sure we avoid
making kerne/userspace API depend on an internal kernel one. Supporing such API
long term when internal one changes will be much more painful.

> > 
> > > We have to use it to allocate irq for each
> > > entry, as well as program the entry in the real hardware.
> > > pci_enable_msix() is only a yes/no choice. We can't add new enabled
> > > entries after pci_enable_msix(),
> > 
> > With the current APIs.
> > 
> > > and we can only enable/disable/mask/unmask one IRQ through kernel API,
> > > not the entry in the MSI-X table. And we still have to allocate new IRQ
> > > for new entry. When guest unmask "disabled entry", we have to disable
> > > and enable MSI-X again in order to use the new entry. That's why
> > > "enabled/disabled entry" concept existed.
> > > 
> > > So even guest only unmasked one entry, it's a totally different work for
> > > KVM underlaying. This logic won't change no matter where the MMIO
> > > handling is. And in fact I don't like this kind of tricky things in
> > > kernel...
> > 
> > A more fundamental problem is that host vectors are a limited resource,
> > we don't want to waste them on entries that will end up unused.
> > One can imagine some kind of logic where we check the pending
> > bit on a masked entry and after a while give up the host vector.
> 
> For the data/address != 0 entries, I haven't seen any of them leave unused in the 
> end. So I don't understand your point here.

I'll try to give another example.

Imagine a guest driver loaded and using N vectors.  It gets unloaded,
then another driver is loaded using N-1 vectors.  vector N is unused but
it is != 0 so we will think it is used and will allocate an entry for
it.

> > 
> > This is what I said: making it spec compliant is harder as it will
> > need core kernel changes. Still, it seems silly to design a
> > kerne/userspace API around an internal API limitation ...
> 
> I still don't think we violate the spec here, in which case we fail to comply the 
> spec? If some devices want to send 0 message to 0 address on x86, it's fail to 
> comply the x86 spec.

Guest can write a non-0 value there and use the vector afterwards.
If we allocate vectors upfront we won't be ablke to support this.

In other words, it works but it's just a heuristic. It's nice that it
works usually, on x86, but we are just lucky and I suspect it will fail
in strange scenarious such as driver change, because it is not
architectural in the spec.

> And I know the current implementation is not elegant due to internal API 
> limitation, but only the word "change" is not enough. Function to enable separate 
> entry is of course good to have, but I still think we don't have enough reason for 
> doing it. 

Implementation is one thing, kernel/user API is another one.
I think we should try to make the second one futureproof
but the implementation can be partial, have some TODO
items and it's okay, and even a good idea to build things step by step.

> --
> regards
> Yang, Sheng
> > 
> > > > > The entries we had used for pci_enable_msix()(not necessary in
> > > > > sequence number) are already enabled, the others are disabled. When
> > > > > device's MSI-X is enabled and guest want to enable an disabled
> > > > > entry, we would go back to QEmu because this vector didn't exist in
> > > > > the routing table. Also due to pci_enable_msix() in kernel didn't
> > > > > allow us to enable vectors one by one, but all at once. So we have
> > > > > to disable MSI-X first, then enable it with new entries, which
> > > > > contained the new vector guest want to use. This situation is only
> > > > > happen when device is being initialized. After that, kernel can know
> > > > > and handle the mask bit of the enabled entry.
> > > > > 
> > > > > I've also considered handle all MMIO operation in kernel, and
> > > > > changing irq routing in kernel directly. But as long as irq routing
> > > > > is owned by QEmu, I think it's better to leave to it...
> > > > 
> > > > Yes, this is my suggestion, except we don't need no routing :)
> > > > To inject MSI you just need address/data pairs.
> > > > Look at kvm_set_msi: given address/data you can just inject
> > > > the interrupt. No need for table lookups.
> > > 
> > > You still need to look up data/address pair in the guest MSI-X table. The
> > > routing table used here is just an replacement for the table, because we
> > > can construct the entry according to the routing table. Two choices,
> > > using the routing table, or creating an new MSI-X table.
> > > 
> > > Still, the key is about who to own the routing/MSI-X table. If kernel own
> > > it, it would be straightforward to intercept all the MMIO in the kernel;
> > > but if it's QEmu, we still need go back to QEmu for it.
> > 
> > Looks cleaner to do it in kernel...
> > 
> > > > > Notice the mask/unmask bits must be handled together, either in
> > > > > kernel or in userspace. Because if kernel has handled enabled
> > > > > vector's mask bit directly, it would be unsync with QEmu's records.
> > > > > It doesn't matter when QEmu don't access the related record. And the
> > > > > only place QEmu want to consult it's enabled entries' mask bit state
> > > > > is writing to MSI addr/data. The writing should be discarded if the
> > > > > entry is unmasked. This checking has already been done by kernel in
> > > > > this patchset, so we are fine here.
> > > > > 
> > > > > If we want to access the enabled entries' mask bit in the future, we
> > > > > can directly access device's MMIO.
> > > > 
> > > > We really must implement this for correctness, btw. If you do not pass
> > > > reads to the device, messages intended for the masked entry
> > > > might still be in flight.
> > > 
> > > Oh, yes, kernel would also mask the device as well. I would take this
> > > into consideration.
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > > That's the reason why I have followed Michael's advice to use
> > > > > mask/unmask directly.
> > > > > Hope this would make the patches more clear. I meant to add comments
> > > > > for this changeset, but miss it later.
> > > > > 
> > > > > --
> > > > > regards
> > > > > Yang, Sheng
> > > > > 
> > > > > > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx>
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > > > >  include/linux/kvm.h       |    8 +++++-
> > > > > >  virt/kvm/assigned-dev.c   |   60
> > > > > > 
> > > > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files 
> changed,
> > > 
> > > 95
> > > 
> > > > > > insertions(+), 1 deletions(-)
> > > > > > 
> > > > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > > > index d82d637..f324a50 100644
> > > > > > --- a/Documentation/kvm/api.txt
> > > > > > +++ b/Documentation/kvm/api.txt
> > > > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > > > 
> > > > > >  If any additional field gets added to this structure later on, a
> > > > > >  bit for
> > > > > > 
> > > > > > that additional piece of information will be set in the flags
> > > > > > bitmap.
> > > > > > 
> > > > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > > > +
> > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > > > +Returns: 0 on success, !0 on error
> > > > > > +
> > > > > > +struct kvm_assigned_msix_mmio {
> > > > > > +	/* Assigned device's ID */
> > > > > > +	__u32 assigned_dev_id;
> > > > > > +	/* MSI-X table MMIO address */
> > > > > > +	__u64 base_addr;
> > > > > > +	/* Must be 0 */
> > > > > > +	__u32 flags;
> > > > > > +	/* Must be 0, reserved for future use */
> > > > > > +	__u64 reserved;
> > > > > > +};
> > > > > > +
> > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > handle MSI-X +mask bit in the kernel.
> > > > > > +
> > > > > > 
> > > > > >  5. The kvm_run structure
> > > > > >  
> > > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index fc62546..ba07a2f 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > 
> > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > >  	case KVM_CAP_XSAVE:
> > > > > > 
> > > > > >  	case KVM_CAP_ENABLE_CAP:
> > > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > > 
> > > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > >  		r = 1;
> > > > > >  		break;
> > > > > >  	
> > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > > > 
> > > > > >  	switch (cap->cap) {
> > > > > > 
> > > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > > > +		r = 0;
> > > > > > +		break;
> > > > > > 
> > > > > >  	default:
> > > > > >  		r = -EINVAL;
> > > > > >  		break;
> > > > > > 
> > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > index 0a7bd34..1494ed0 100644
> > > > > > --- a/include/linux/kvm.h
> > > > > > +++ b/include/linux/kvm.h
> > > > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > > > 
> > > > > >  #endif
> > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > > > +#endif
> > > > > > 
> > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > > 
> > > > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > > > 
> > > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > > > 
> > > > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > > > +					struct kvm_assigned_msix_mmio)
> > > > > > 
> > > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > > > 
> > > > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO, 
> > > > > > 0xa0, struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > > > kvm_assigned_msix_mmio {
> > > > > > 
> > > > > >  	__u32 assigned_dev_id;
> > > > > >  	__u64 base_addr;
> > > > > >  	__u32 flags;
> > > > > > 
> > > > > > -	__u32 reserved[2];
> > > > > > +	__u64 reserved;
> > > > > > 
> > > > > >  };
> > > > > >  
> > > > > >  #endif /* __LINUX_KVM_H */
> > > > > > 
> > > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > > index 5d2adc4..9573194 100644
> > > > > > --- a/virt/kvm/assigned-dev.c
> > > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > > @@ -17,6 +17,8 @@
> > > > > > 
> > > > > >  #include <linux/pci.h>
> > > > > >  #include <linux/interrupt.h>
> > > > > >  #include <linux/slab.h>
> > > > > > 
> > > > > > +#include <linux/irqnr.h>
> > > > > > +
> > > > > > 
> > > > > >  #include "irq.h"
> > > > > >  
> > > > > >  static struct kvm_assigned_dev_kernel
> > > > > >  *kvm_find_assigned_dev(struct
> > > > > > 
> > > > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > > > deassign_host_irq(struct kvm *kvm, */
> > > > > > 
> > > > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > > > >  	
> > > > > >  		int i;
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +		if (assigned_dev->msix_mmio_base) {
> > > > > > +			mutex_lock(&kvm->slots_lock);
> > > > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > > +					&assigned_dev->msix_mmio_dev);
> > > > > > +			mutex_unlock(&kvm->slots_lock);
> > > > > > +		}
> > > > > > +#endif
> > > > > > 
> > > > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > > > >  		
> > > > > >  			disable_irq_nosync(assigned_dev->
> > > > > >  			
> > > > > >  					   host_msix_entries[i].vector);
> > > > > > 
> > > > > > @@ -318,6 +328,15 @@ static int
> > > > > > assigned_device_enable_host_msix(struct kvm *kvm, goto err;
> > > > > > 
> > > > > >  	}
> > > > > > 
> > > > > > +	if (dev->msix_mmio_base) {
> > > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > > +				&dev->msix_mmio_dev);
> > > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > > +		if (r)
> > > > > > +			goto err;
> > > > > > +	}
> > > > > > +
> > > > > > 
> > > > > >  	return 0;
> > > > > >  
> > > > > >  err:
> > > > > >  	for (i -= 1; i >= 0; i--)
> > > > > > 
> > > > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > > > 
> > > > > >  };
> > > > > > 
> > > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > > +{
> > > > > > +	int r = 0;
> > > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > > +
> > > > > > +	mutex_lock(&kvm->lock);
> > > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > > +				      msix_mmio->assigned_dev_id);
> > > > > > +	if (!adev) {
> > > > > > +		r = -EINVAL;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > > +		r = -EINVAL;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > > +
> > > > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > > +out:
> > > > > > +	mutex_unlock(&kvm->lock);
> > > > > > +
> > > > > > +	return r;
> > > > > > +}
> > > > > > 
> > > > > >  #endif
> > > > > >  
> > > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > > 
> > > > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > > *kvm, unsigned ioctl, goto out;
> > > > > > 
> > > > > >  		break;
> > > > > >  	
> > > > > >  	}
> > > > > > 
> > > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > > +
> > > > > > +		r = -EFAULT;
> > > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		r = -EINVAL;
> > > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > > +		if (r)
> > > > > > +			goto out;
> > > > > > +		break;
> > > > > > +	}
> > > > > > 
> > > > > >  #endif
> > > > > >  
> > > > > >  	}
> > > > > >  
> > > > > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html