Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices

Sheng Yang <sheng@xxxxxxxxxxxxxxx> · Fri, 22 Oct 2010 21:30:09 +0800

On Friday 22 October 2010 18:17:05 Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> > On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > 
> > > > The basic idea of kernel and QEmu's responsibilities are:
> > > > 
> > > > 1. Because QEmu owned the irq routing table, so the change of table
> > > > should still go to the QEmu, like we did in msix_mmio_write().
> > > > 
> > > > 2. And the others things can be done in kernel, for performance. Here
> > > > we covered the reading(converted entry from routing table and mask
> > > > bit state of enabled MSI-X entries), and writing the mask bit for
> > > > enabled MSI-X entries. Originally we only has mask bit handled in
> > > > kernel, but later we found that Linux kernel would read MSI-X mmio
> > > > just after every writing to mask bit, in order to flush the writing.
> > > > So we add reading MSI data/addr as well.
> > > > 
> > > > 3. Disabled entries's mask bit accessing would go to QEmu, because it
> > > > may result in disable/enable MSI-X. Explained later.
> > > > 
> > > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu
> > > > to decide enable/disable MSI-X for device.
> > > > .
> > > 
> > > Config space yes, but it's a simple global yes/no after all.
> > > 
> > > > 5. There is an distinction between enabled entry and disabled entry
> > > > of MSI-X table.
> > > 
> > > That's my point. There's no such thing as 'enabled entries'
> > > in the spec. There are only masked and unmasked entries.
> > > 
> > > Current interface deals with gsi numbers so qemu had to work around
> > > this. The hack used there is removing gsi for masked vector which has 0
> > > address and data.  It works because this is what linux and windows
> > > guests happen to do, but it is out of spec: vector/data value for a
> > > masked entry have no meaning.
> > 
> > Well, I just realized something unnatural about the 0 contained
> > data/address entry. So I checked spec again, and found the mask bit
> > should be set after reset. So after fix this, I think unmasked 0
> > address/data entry shouldn't be there anymore.
> 
> You are right that this 0 check is completely out of spec.  But see
> below. The issue is the spec does not require  you to mask an entry if
> the device does not use it. And we do not want to waste host vectors.
> One can imagine a logic where we would detect that an interrupt
> has not been used in a long while, mask it and give up
> a host vector. Then check the pending bit once in a while to
> see whether device started using it.

I don't think introducing this complex and speculative logic makes sense. I 
haven't seen any alike scenario yet. What's the issue of current implemenation?

> > > Since you are building a new interface, can design it without
> > > constraints...
> > 
> > A constraint is pci_enable_msix().
> 
> It's an internal kernel API. No one prevents us from changing it.

You can say no one. But I'm afraid if we want to overhaul this kind of core PCI 
functions, it may be take months to get it checked in upstream - also assume we 
can persuade them this overhaul is absolutely needed (I hope I'm wrong on this). 
This may also means we have to find out another user for this kind of change. The 
key issue is I don't know what we can gain for certain from it. Current 
disable/enable mechanism still works well. I don't know why we need spend a lot of 
effort on this just because spec don't say there is "enabled/disabled entries". 

Yes it's not that elegant, but we need carefully think if the effort worth it.
> 
> > We have to use it to allocate irq for each
> > entry, as well as program the entry in the real hardware.
> > pci_enable_msix() is only a yes/no choice. We can't add new enabled
> > entries after pci_enable_msix(),
> 
> With the current APIs.
> 
> > and we can only enable/disable/mask/unmask one IRQ through kernel API,
> > not the entry in the MSI-X table. And we still have to allocate new IRQ
> > for new entry. When guest unmask "disabled entry", we have to disable
> > and enable MSI-X again in order to use the new entry. That's why
> > "enabled/disabled entry" concept existed.
> > 
> > So even guest only unmasked one entry, it's a totally different work for
> > KVM underlaying. This logic won't change no matter where the MMIO
> > handling is. And in fact I don't like this kind of tricky things in
> > kernel...
> 
> A more fundamental problem is that host vectors are a limited resource,
> we don't want to waste them on entries that will end up unused.
> One can imagine some kind of logic where we check the pending
> bit on a masked entry and after a while give up the host vector.

For the data/address != 0 entries, I haven't seen any of them leave unused in the 
end. So I don't understand your point here.
> 
> This is what I said: making it spec compliant is harder as it will
> need core kernel changes. Still, it seems silly to design a
> kerne/userspace API around an internal API limitation ...

I still don't think we violate the spec here, in which case we fail to comply the 
spec? If some devices want to send 0 message to 0 address on x86, it's fail to 
comply the x86 spec.

And I know the current implementation is not elegant due to internal API 
limitation, but only the word "change" is not enough. Function to enable separate 
entry is of course good to have, but I still think we don't have enough reason for 
doing it. 

--
regards
Yang, Sheng
> 
> > > > The entries we had used for pci_enable_msix()(not necessary in
> > > > sequence number) are already enabled, the others are disabled. When
> > > > device's MSI-X is enabled and guest want to enable an disabled
> > > > entry, we would go back to QEmu because this vector didn't exist in
> > > > the routing table. Also due to pci_enable_msix() in kernel didn't
> > > > allow us to enable vectors one by one, but all at once. So we have
> > > > to disable MSI-X first, then enable it with new entries, which
> > > > contained the new vector guest want to use. This situation is only
> > > > happen when device is being initialized. After that, kernel can know
> > > > and handle the mask bit of the enabled entry.
> > > > 
> > > > I've also considered handle all MMIO operation in kernel, and
> > > > changing irq routing in kernel directly. But as long as irq routing
> > > > is owned by QEmu, I think it's better to leave to it...
> > > 
> > > Yes, this is my suggestion, except we don't need no routing :)
> > > To inject MSI you just need address/data pairs.
> > > Look at kvm_set_msi: given address/data you can just inject
> > > the interrupt. No need for table lookups.
> > 
> > You still need to look up data/address pair in the guest MSI-X table. The
> > routing table used here is just an replacement for the table, because we
> > can construct the entry according to the routing table. Two choices,
> > using the routing table, or creating an new MSI-X table.
> > 
> > Still, the key is about who to own the routing/MSI-X table. If kernel own
> > it, it would be straightforward to intercept all the MMIO in the kernel;
> > but if it's QEmu, we still need go back to QEmu for it.
> 
> Looks cleaner to do it in kernel...
> 
> > > > Notice the mask/unmask bits must be handled together, either in
> > > > kernel or in userspace. Because if kernel has handled enabled
> > > > vector's mask bit directly, it would be unsync with QEmu's records.
> > > > It doesn't matter when QEmu don't access the related record. And the
> > > > only place QEmu want to consult it's enabled entries' mask bit state
> > > > is writing to MSI addr/data. The writing should be discarded if the
> > > > entry is unmasked. This checking has already been done by kernel in
> > > > this patchset, so we are fine here.
> > > > 
> > > > If we want to access the enabled entries' mask bit in the future, we
> > > > can directly access device's MMIO.
> > > 
> > > We really must implement this for correctness, btw. If you do not pass
> > > reads to the device, messages intended for the masked entry
> > > might still be in flight.
> > 
> > Oh, yes, kernel would also mask the device as well. I would take this
> > into consideration.
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > That's the reason why I have followed Michael's advice to use
> > > > mask/unmask directly.
> > > > Hope this would make the patches more clear. I meant to add comments
> > > > for this changeset, but miss it later.
> > > > 
> > > > --
> > > > regards
> > > > Yang, Sheng
> > > > 
> > > > > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx>
> > > > > ---
> > > > > 
> > > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > > >  include/linux/kvm.h       |    8 +++++-
> > > > >  virt/kvm/assigned-dev.c   |   60
> > > > > 
> > > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files 
changed,
> > 
> > 95
> > 
> > > > > insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > > index d82d637..f324a50 100644
> > > > > --- a/Documentation/kvm/api.txt
> > > > > +++ b/Documentation/kvm/api.txt
> > > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > > 
> > > > >  If any additional field gets added to this structure later on, a
> > > > >  bit for
> > > > > 
> > > > > that additional piece of information will be set in the flags
> > > > > bitmap.
> > > > > 
> > > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > > +
> > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > > +Returns: 0 on success, !0 on error
> > > > > +
> > > > > +struct kvm_assigned_msix_mmio {
> > > > > +	/* Assigned device's ID */
> > > > > +	__u32 assigned_dev_id;
> > > > > +	/* MSI-X table MMIO address */
> > > > > +	__u64 base_addr;
> > > > > +	/* Must be 0 */
> > > > > +	__u32 flags;
> > > > > +	/* Must be 0, reserved for future use */
> > > > > +	__u64 reserved;
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > > +
> > > > > 
> > > > >  5. The kvm_run structure
> > > > >  
> > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index fc62546..ba07a2f 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > >  	case KVM_CAP_XSAVE:
> > > > > 
> > > > >  	case KVM_CAP_ENABLE_CAP:
> > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > 
> > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > > 
> > > > >  	switch (cap->cap) {
> > > > > 
> > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > > +		r = 0;
> > > > > +		break;
> > > > > 
> > > > >  	default:
> > > > >  		r = -EINVAL;
> > > > >  		break;
> > > > > 
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 0a7bd34..1494ed0 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > > +#endif
> > > > > 
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > 
> > > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > > 
> > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > > 
> > > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > > +					struct kvm_assigned_msix_mmio)
> > > > > 
> > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > > 
> > > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO, 
> > > > > 0xa0, struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > > kvm_assigned_msix_mmio {
> > > > > 
> > > > >  	__u32 assigned_dev_id;
> > > > >  	__u64 base_addr;
> > > > >  	__u32 flags;
> > > > > 
> > > > > -	__u32 reserved[2];
> > > > > +	__u64 reserved;
> > > > > 
> > > > >  };
> > > > >  
> > > > >  #endif /* __LINUX_KVM_H */
> > > > > 
> > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > index 5d2adc4..9573194 100644
> > > > > --- a/virt/kvm/assigned-dev.c
> > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > @@ -17,6 +17,8 @@
> > > > > 
> > > > >  #include <linux/pci.h>
> > > > >  #include <linux/interrupt.h>
> > > > >  #include <linux/slab.h>
> > > > > 
> > > > > +#include <linux/irqnr.h>
> > > > > +
> > > > > 
> > > > >  #include "irq.h"
> > > > >  
> > > > >  static struct kvm_assigned_dev_kernel
> > > > >  *kvm_find_assigned_dev(struct
> > > > > 
> > > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > > deassign_host_irq(struct kvm *kvm, */
> > > > > 
> > > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > > >  	
> > > > >  		int i;
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +		if (assigned_dev->msix_mmio_base) {
> > > > > +			mutex_lock(&kvm->slots_lock);
> > > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > +					&assigned_dev->msix_mmio_dev);
> > > > > +			mutex_unlock(&kvm->slots_lock);
> > > > > +		}
> > > > > +#endif
> > > > > 
> > > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > > >  		
> > > > >  			disable_irq_nosync(assigned_dev->
> > > > >  			
> > > > >  					   host_msix_entries[i].vector);
> > > > > 
> > > > > @@ -318,6 +328,15 @@ static int
> > > > > assigned_device_enable_host_msix(struct kvm *kvm, goto err;
> > > > > 
> > > > >  	}
> > > > > 
> > > > > +	if (dev->msix_mmio_base) {
> > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > +				&dev->msix_mmio_dev);
> > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > +		if (r)
> > > > > +			goto err;
> > > > > +	}
> > > > > +
> > > > > 
> > > > >  	return 0;
> > > > >  
> > > > >  err:
> > > > >  	for (i -= 1; i >= 0; i--)
> > > > > 
> > > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > > 
> > > > >  };
> > > > > 
> > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > +{
> > > > > +	int r = 0;
> > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > +
> > > > > +	mutex_lock(&kvm->lock);
> > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > +				      msix_mmio->assigned_dev_id);
> > > > > +	if (!adev) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > +
> > > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > +out:
> > > > > +	mutex_unlock(&kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > 
> > > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > *kvm, unsigned ioctl, goto out;
> > > > > 
> > > > >  		break;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > +
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > +			goto out;
> > > > > +
> > > > > +		r = -EINVAL;
> > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > +			goto out;
> > > > > +
> > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		break;
> > > > > +	}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  	}
> > > > >  
> > > > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html