Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 13 Oct 2010 12:35:22 +0200

On Wed, Oct 13, 2010 at 09:03:11AM +0800, Sheng Yang wrote:
> On Wednesday 13 October 2010 02:28:59 Michael S. Tsirkin wrote:
> > On Tue, Oct 12, 2010 at 02:49:58PM +0800, Sheng Yang wrote:
> > > On Monday 11 October 2010 18:01:00 Michael S. Tsirkin wrote:
> > > > On Mon, Oct 11, 2010 at 05:28:30PM +0800, Sheng Yang wrote:
> > > > > On Sunday 03 October 2010 19:12:47 Michael S. Tsirkin wrote:
> > > > > > On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > > > > > > This patch enable per-vector mask for assigned devices using
> > > > > > > MSI-X.
> > > > > > > 
> > > > > > > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx>
> > > > > > 
> > > > > > I think I see an issue here, noted below.  Some general comments:
> > > > > > - The mask bit seems broken for injecting interrupts from
> > > > > > 
> > > > > >   userspace (with interrupts/ioctls).
> > > > > >   I think we must test it on injection path.
> > > > > 
> > > > > I am not quite understand how it related to userspace interrupt
> > > > > injection here... This patch only cover assigned devices for now.
> > > > 
> > > > Well, this is a kernel/userspace interface, if it's broken for
> > > > userspace injection now we'll have to go through pain to fix it in a
> > > > compatible way later when we want to use it for userspace injection.
> > > > You might want to ask why we want the kernel to support making
> > > > userspace-injected interrupts when userspace can just avoid injecting
> > > > them, and the answer would be that with irqfd the injection might be
> > > > handled in a separate process.
> > > 
> > > OK, I've understood how it related to userspace interrupt injection. But
> > > I still can't see why the interface is broken...
> > > 
> > > > We currently handle this by destroying irqfd when irq is masked,
> > > > an ioctl instead would be much faster.
> > > > 
> > > > > > - We'll need a way to support the pending bit.
> > > > > > 
> > > > > >   Disabling interrupts will not let us do it.
> > > > > 
> > > > > We may need a way to support pending bit, though I don't know which
> > > > > guest has used it... And we can still know if there is interrupt
> > > > > pending by check the real hardware's pending bit if it's necessary.
> > > > 
> > > > That's what I'm saying: since instead of masking the vector in hardware
> > > > you disable irq in the APIC, the pending bit that we read from hardware
> > > > will not have the correct value.
> > > 
> > > Are you sure? This disable_irq() has nothing to do with APIC. The disable
> > > callback in msi_chip didn't do anything but mark the IRQ status as
> > > IRQ_DISABLED, and the follow interrupt(if there are any) would be acked
> > > and masked, using mask callback in msi_chip.
> > > 
> > > irq_to_desc() need to be exported for my initial version, in order to use
> > > mask callback. But later I think it would be clear and better if we use
> > > general IRQ function to do it. And I don't think the current solution
> > > would prevent us from reading hardware pending bits.
> > 
> > Not sure, I'll try to look into code later,  but just based on this
> > description:
> > 
> > on real hardware:
> > 	mask
> > 	interrupt
> > results in both pending bit being set
> > 
> > on guest with assigned device
> > 	mask
> > 	interrupt
> > results in mask being set but pending bit not set
> > (as interrupt was already sent)
> > 
> > So if we try to look at pending bits looks like we'll miss
> > some interrupts. No?
> 
> Yes, there is one interrupt would fail to set the pending bit. But I don't think 
> it worth export/adding new interface for core irq handling functions now...

Yea, maybe. For assigned devices pending bit is not currently
implemented, so at least it's not a regression, and we can delay
fixing this until we have VFIO.

> Still, 
> I don't think pending bit matters much. And with current code, we can still make 
> pending bit works well on the most condition. I think that's good enough. 

Hmm, my guess is it's probably better to have a constant 0 there than a
subtle race that will only trigger under load.

> --
> regards
> Yang, Sheng
> 
> > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > If we fix this, pending bit handling can be done by userspace.
> > > > 
> > > > > (And we haven't seen any problem by
> > > > > leaving the bit 0 so far, and it's not in this patch's scope.)
> > > > 
> > > > I don't know about anyone using this, either, but the PCI spec does
> > > > require support of polling mode where the pending bit is polled instead
> > > > of interrupts. So yes, not a high priority to implement, but let's give
> > > > the way we intend to support this in the future some thought.
> > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > >  arch/x86/kvm/x86.c       |    1 +
> > > > > > >  include/linux/kvm.h      |    9 ++++++++-
> > > > > > >  include/linux/kvm_host.h |    1 +
> > > > > > >  virt/kvm/assigned-dev.c  |   39
> > > > > 
> > > > > +++++++++++++++++++++++++++++++++++++++
> > > > > 
> > > > > > >  4 files changed, 49 insertions(+), 1 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index 8412c91..e6933e6 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > > 
> > > > > > >  	case KVM_CAP_DEBUGREGS:
> > > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > > > 
> > > > > > >  	case KVM_CAP_XSAVE:
> > > > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > > >  		r = 1;
> > > > > > >  		break;
> > > > > > >  	
> > > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > > index 919ae53..f2b7cdc 100644
> > > > > > > --- a/include/linux/kvm.h
> > > > > > > +++ b/include/linux/kvm.h
> > > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > > > 
> > > > > > >  #endif
> > > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > > 
> > > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > > > +#endif
> > > > > > > 
> > > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > > > 
> > > > > > > @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr {
> > > > > > > 
> > > > > > >  };
> > > > > > >  
> > > > > > >  #define KVM_MAX_MSIX_PER_DEV		256
> > > > > > > 
> > > > > > > +
> > > > > > > +#define KVM_MSIX_FLAG_MASK	1
> > > > > > > +
> > > > > > > 
> > > > > > >  struct kvm_assigned_msix_entry {
> > > > > > >  
> > > > > > >  	__u32 assigned_dev_id;
> > > > > > >  	__u32 gsi;
> > > > > > >  	__u16 entry; /* The index of entry in the MSI-X table */
> > > > > > > 
> > > > > > > -	__u16 padding[3];
> > > > > > > +	__u16 flags;
> > > > > > > +	__u16 padding[2];
> > > > > > > 
> > > > > > >  };
> > > > > > >  
> > > > > > >  #endif /* __LINUX_KVM_H */
> > > > > > > 
> > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > index 0b89d00..a43405c 100644
> > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier {
> > > > > > > 
> > > > > > >  };
> > > > > > >  
> > > > > > >  #define KVM_ASSIGNED_MSIX_PENDING		0x1
> > > > > > > 
> > > > > > > +#define KVM_ASSIGNED_MSIX_MASK			0x2
> > > > > > > 
> > > > > > >  struct kvm_guest_msix_entry {
> > > > > > >  
> > > > > > >  	u32 vector;
> > > > > > >  	u16 entry;
> > > > > > > 
> > > > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > > > index 7c98928..15b8c32 100644
> > > > > > > --- a/virt/kvm/assigned-dev.c
> > > > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > > > @@ -17,6 +17,8 @@
> > > > > > > 
> > > > > > >  #include <linux/pci.h>
> > > > > > >  #include <linux/interrupt.h>
> > > > > > >  #include <linux/slab.h>
> > > > > > > 
> > > > > > > +#include <linux/irqnr.h>
> > > > > > > +
> > > > > > > 
> > > > > > >  #include "irq.h"
> > > > > > >  
> > > > > > >  static struct kvm_assigned_dev_kernel
> > > > > > >  *kvm_find_assigned_dev(struct list_head *head,
> > > > > > > 
> > > > > > > @@ -666,6 +668,30 @@ msix_nr_out:
> > > > > > >  	return r;
> > > > > > >  
> > > > > > >  }
> > > > > > > 
> > > > > > > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > > > > > > *assigned_dev, +			     int index)
> > > > > > > +{
> > > > > > > +	int irq;
> > > > > > > +
> > > > > > > +	if (!assigned_dev->dev->msix_enabled ||
> > > > > > > +	    !(assigned_dev->irq_requested_type &
> > > > > > > KVM_DEV_IRQ_HOST_MSIX)) +		return;
> > > > > > > +
> > > > > > > +	irq = assigned_dev->host_msix_entries[index].vector;
> > > > > > > +
> > > > > > > +	ASSERT(irq != 0);
> > > > > > > +
> > > > > > > +	if (assigned_dev->guest_msix_entries[index].flags &
> > > > > > > +			KVM_ASSIGNED_MSIX_MASK)
> > > > > > > +		disable_irq(irq);
> > > > > > > +	else {
> > > > > > > +		enable_irq(irq);
> > > > > > > +		if (assigned_dev->guest_msix_entries[index].flags &
> > > > > > > +				KVM_ASSIGNED_MSIX_PENDING)
> > > > > > > +			schedule_work(&assigned_dev->interrupt_work);
> > > > > > > +	}
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > What happens if guest masks an entry and then we hot-unplug the
> > > > > > device and remove it from guest? It looks like interrupt
> > > > > > will stay disabled?
> > > > > 
> > > > > I don't think so. pci_disable_msix() which was called in hot-unplug
> > > > > path would recycle all IRQs used by the device. It should be the
> > > > > same as VM shutdown.
> > > > > 
> > > > > Also before the IRQ was recycled, I believe the same dynamic IRQ
> > > > > wouldn't be used by other devices.
> > > > > 
> > > > > --
> > > > > regards
> > > > > Yang, Sheng
> > > > > 
> > > > > > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > > > > > >  
> > > > > > >  				       struct kvm_assigned_msix_entry *entry)
> > > > > > >  
> > > > > > >  {
> > > > > > > 
> > > > > > > @@ -688,6 +714,19 @@ static int
> > > > > > > kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > > > > > > 
> > > > > > >  			adev->guest_msix_entries[i].entry = entry->entry;
> > > > > > >  			adev->guest_msix_entries[i].vector = entry->gsi;
> > > > > > >  			adev->host_msix_entries[i].entry = entry->entry;
> > > > > > > 
> > > > > > > +			if ((entry->flags & KVM_MSIX_FLAG_MASK) &&
> > > > > > > +					!(adev->guest_msix_entries[i].flags &
> > > > > > > +					KVM_ASSIGNED_MSIX_MASK)) {
> > > > > > > +				adev->guest_msix_entries[i].flags |=
> > > > > > > +					KVM_ASSIGNED_MSIX_MASK;
> > > > > > > +				update_msix_mask(adev, i);
> > > > > > > +			} else if (!(entry->flags & KVM_MSIX_FLAG_MASK) &&
> > > > > > > +					(adev->guest_msix_entries[i].flags &
> > > > > > > +					KVM_ASSIGNED_MSIX_MASK)) {
> > > > > > > +				adev->guest_msix_entries[i].flags &=
> > > > > > > +					~KVM_ASSIGNED_MSIX_MASK;
> > > > > > > +				update_msix_mask(adev, i);
> > > > > > > +			}
> > > > > > > 
> > > > > > >  			break;
> > > > > > >  		
> > > > > > >  		}
> > > > > > >  	
> > > > > > >  	if (i == adev->entries_nr) {
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html