Re: [PATCH] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

Jan Kiszka <jan.kiszka@xxxxxxxxxxx> · Tue, 10 Jan 2012 18:29:51 +0100

On 2012-01-10 17:17, Michael S. Tsirkin wrote:
> On Mon, Jan 09, 2012 at 03:03:00PM +0100, Jan Kiszka wrote:
>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>> enables us to share legacy IRQs of such devices with other host devices
>> when passing them to a guest.
>>
>> The new IRQ sharing feature introduced here is optional, user space has
>> to request it explicitly. Moreover, user space can inform us about its
>> view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
>> interrupt and signaling it if the guest masked it via the virtualized
>> PCI config space.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx>
>> ---
>>
>> This applies to kvm/master after merging
>>
>>   PCI: Rework config space blocking services
>>   PCI: Introduce INTx check & mask API
>>
>> from current linux-next/master. I suppose those two will make it into
>> 3.3.
>>
>> To recall the history of it: I tried hard to implement an adaptive
>> solution that automatically picks the fastest masking technique whenever
>> possible. However, the changes required to the IRQ core subsystem and
>> the logic of the device assignment code became so complex and partly
>> ugly that I gave up on this. It's simply not worth the pain given that
>> legacy PCI interrupts are rarely raised for performance critical device
>> at such a high rate (KHz...) that you can measure the difference.
>>
>>  Documentation/virtual/kvm/api.txt |   27 +++++
>>  arch/x86/kvm/x86.c                |    1 +
>>  include/linux/kvm.h               |    6 +
>>  include/linux/kvm_host.h          |    2 +
>>  virt/kvm/assigned-dev.c           |  208 +++++++++++++++++++++++++++++++-----
>>  5 files changed, 215 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index e1d94bf..670015a 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -1159,6 +1159,14 @@ following flags are specified:
>>
>>  /* Depends on KVM_CAP_IOMMU */
>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU  (1 << 0)
>> +/* The following two depend on KVM_CAP_PCI_2_3 */
>> +#define KVM_DEV_ASSIGN_PCI_2_3               (1 << 1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX     (1 << 2)
>> +
>> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
>> +via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
>> +assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
>> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>>
>>  The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
>>  isolation of the device.  Usages not specifying this flag are deprecated.
>> @@ -1399,6 +1407,25 @@ The following flags are defined:
>>  If datamatch flag is set, the event will be signaled only if the written value
>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>
>> +4.59 KVM_ASSIGN_SET_INTX_MASK
>> +
>> +Capability: KVM_CAP_PCI_2_3
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct kvm_assigned_pci_dev (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +Informs the kernel about the guest's view on the INTx mask.
> 
> A wild idea: since this is guest view of its IRQ,
> can this be specified per guest IRQ+id then?
> That might be useful to support MSIX mask bit emulation.

I do not yet get the full idea: You want some generic
KVM_ASSIGN_SET_IRQ_MASK? What will be the use case in the MSI[X] area?

> 
>> As long as the
>> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>> +hardware level and will not assert the guest's IRQ line. User space is still
>> +responsible for applying this state to the assigned device's real config space.
> 
> Can this be made more explicit? You mean writing into 1st
> byte of PCI control, right?

For sure, I can state this.

> 
>> +To avoid that the kernel overwrites the state user space wants to set,
>> +KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config space.
> 
> This looks like a strange requirement, could you explain how
> this helps avoid races?

By declaring the target state of the INTx bit first to the kernel,
concurrent changes of the kernel while user space performs a
read-modify-write will not lead to an old mask state being written.

> This also raises questions about
> what should be done to write a bit unrelated to masking.

Just write it, using the INTx state user space maintains. In the worst
case, some masking done by the kernel in the meantime will be
overwritten, leading to a single spurious but harmless IRQ. That event
won't be delivered to the guest unless it is ready to receive it - as we
updated the mask state prior to writing to the config space. The point
is that the kernel mechanism has to deal with crazy user space clearing
the mask for whatever reason again.

> 
>> +
>> +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
>> +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
>> +evaluated.
>> +
>>  4.62 KVM_CREATE_SPAPR_TCE
>>
>>  Capability: KVM_CAP_SPAPR_TCE
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1171def..9381806 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2057,6 +2057,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>>       case KVM_CAP_XSAVE:
>>       case KVM_CAP_ASYNC_PF:
>>       case KVM_CAP_GET_TSC_KHZ:
>> +     case KVM_CAP_PCI_2_3:
>>               r = 1;
>>               break;
>>       case KVM_CAP_COALESCED_MMIO:
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 68e67e5..da5f7b7 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -556,6 +556,7 @@ struct kvm_ppc_pvinfo {
>>  #define KVM_CAP_PPC_RMA      65
>>  #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
>>  #define KVM_CAP_PPC_PAPR 68
>> +#define KVM_CAP_PCI_2_3 69
>>  #define KVM_CAP_S390_GMAP 71
>>  #define KVM_CAP_TSC_DEADLINE_TIMER 72
>>
>> @@ -697,6 +698,9 @@ struct kvm_clock_data {
>>  /* Available with KVM_CAP_TSC_CONTROL */
>>  #define KVM_SET_TSC_KHZ           _IO(KVMIO,  0xa2)
>>  #define KVM_GET_TSC_KHZ           _IO(KVMIO,  0xa3)
>> +/* Available with KVM_CAP_PCI_2_3 */
>> +#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
>> +                                    struct kvm_assigned_pci_dev)
>>
>>  /*
>>   * ioctls for vcpu fds
>> @@ -765,6 +769,8 @@ struct kvm_clock_data {
>>  #define KVM_ALLOCATE_RMA       _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>>
>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU  (1 << 0)
>> +#define KVM_DEV_ASSIGN_PCI_2_3               (1 << 1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX     (1 << 2)
>>
>>  struct kvm_assigned_pci_dev {
>>       __u32 assigned_dev_id;
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 900c763..07461bd 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -546,6 +546,7 @@ struct kvm_assigned_dev_kernel {
>>       unsigned int entries_nr;
>>       int host_irq;
>>       bool host_irq_disabled;
>> +     bool pci_2_3;
>>       struct msix_entry *host_msix_entries;
>>       int guest_irq;
>>       struct msix_entry *guest_msix_entries;
>> @@ -555,6 +556,7 @@ struct kvm_assigned_dev_kernel {
>>       struct pci_dev *dev;
>>       struct kvm *kvm;
>>       spinlock_t intx_lock;
>> +     struct mutex intx_mask_lock;
> 
> What exactly does this lock protect?
> I see it used sometimes with intx_lock and sometimes without.

The KVM_DEV_ASSIGN_MASK_INTX bit in flags.

> 
>>       char irq_name[32];
>>       struct pci_saved_state *pci_saved_state;
>>  };
>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>> index 758e3b3..b35aba9 100644
>> --- a/virt/kvm/assigned-dev.c
>> +++ b/virt/kvm/assigned-dev.c
>> @@ -57,22 +57,66 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
>>       return index;
>>  }
>>
>> -static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
>> +static irqreturn_t kvm_assigned_dev_intx(int irq, void *dev_id)
>>  {
>>       struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +     int ret;
>>
>> -     if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
>> -             spin_lock(&assigned_dev->intx_lock);
>> +     spin_lock(&assigned_dev->intx_lock);
>> +     if (pci_check_and_mask_intx(assigned_dev->dev)) {
>> +             assigned_dev->host_irq_disabled = true;
>> +             ret = IRQ_WAKE_THREAD;
>> +     } else
>> +             ret = IRQ_NONE;
>> +     spin_unlock(&assigned_dev->intx_lock);
>> +
>> +     return ret;
>> +}
>> +
>> +static void
>> +kvm_assigned_dev_raise_guest_irq(struct kvm_assigned_dev_kernel *assigned_dev,
>> +                              int vector)
>> +{
>> +     if (unlikely(assigned_dev->irq_requested_type &
>> +                  KVM_DEV_IRQ_GUEST_INTX)) {
>> +             mutex_lock(&assigned_dev->intx_mask_lock);
>> +             if (!(assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX))
>> +                     kvm_set_irq(assigned_dev->kvm,
>> +                                 assigned_dev->irq_source_id, vector, 1);
>> +             mutex_unlock(&assigned_dev->intx_mask_lock);
>> +     } else
>> +             kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> +                         vector, 1);
>> +}
>> +
>> +static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id)
>> +{
>> +     struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +
>> +     if (!(assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3)) {
>> +             spin_lock_irq(&assigned_dev->intx_lock);
>>               disable_irq_nosync(irq);
>>               assigned_dev->host_irq_disabled = true;
>> -             spin_unlock(&assigned_dev->intx_lock);
>> +             spin_unlock_irq(&assigned_dev->intx_lock);
>>       }
>>
>> -     kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> -                 assigned_dev->guest_irq, 1);
>> +     kvm_assigned_dev_raise_guest_irq(assigned_dev,
>> +                                      assigned_dev->guest_irq);
>> +
>> +     return IRQ_HANDLED;
>> +}
>> +
>> +#ifdef __KVM_HAVE_MSI
>> +static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
>> +{
>> +     struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +
>> +     kvm_assigned_dev_raise_guest_irq(assigned_dev,
>> +                                      assigned_dev->guest_irq);
>>
>>       return IRQ_HANDLED;
>>  }
>> +#endif
>>
>>  #ifdef __KVM_HAVE_MSIX
>>  static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
>> @@ -83,8 +127,7 @@ static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
>>
>>       if (index >= 0) {
>>               vector = assigned_dev->guest_msix_entries[index].vector;
>> -             kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> -                         vector, 1);
>> +             kvm_assigned_dev_raise_guest_irq(assigned_dev, vector);
>>       }
>>
>>       return IRQ_HANDLED;
>> @@ -100,15 +143,31 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
>>
>>       kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
>>
>> -     /* The guest irq may be shared so this ack may be
>> -      * from another device.
>> -      */
>> -     spin_lock(&dev->intx_lock);
>> -     if (dev->host_irq_disabled) {
>> -             enable_irq(dev->host_irq);
>> -             dev->host_irq_disabled = false;
>> +     mutex_lock(&dev->intx_mask_lock);
>> +
>> +     if (!(dev->flags & KVM_DEV_ASSIGN_MASK_INTX)) {
>> +             bool reassert = false;
>> +
>> +             spin_lock_irq(&dev->intx_lock);
>> +             /*
>> +              * The guest IRQ may be shared so this ack can come from an
>> +              * IRQ for another guest device.
>> +              */
>> +             if (dev->host_irq_disabled) {
>> +                     if (!(dev->flags & KVM_DEV_ASSIGN_PCI_2_3))
>> +                             enable_irq(dev->host_irq);
>> +                     else if (!pci_check_and_unmask_intx(dev->dev))
>> +                             reassert = true;
>> +                     dev->host_irq_disabled = reassert;
>> +             }
>> +             spin_unlock_irq(&dev->intx_lock);
>> +
>> +             if (reassert)
>> +                     kvm_set_irq(dev->kvm, dev->irq_source_id,
>> +                                 dev->guest_irq, 1);
>>       }
>> -     spin_unlock(&dev->intx_lock);
>> +
>> +     mutex_unlock(&dev->intx_mask_lock);
>>  }
>>
>>  static void deassign_guest_irq(struct kvm *kvm,
>> @@ -156,7 +215,13 @@ static void deassign_host_irq(struct kvm *kvm,
>>               pci_disable_msix(assigned_dev->dev);
>>       } else {
>>               /* Deal with MSI and INTx */
>> -             disable_irq(assigned_dev->host_irq);
>> +             if (assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
>> +                     spin_lock_irq(&assigned_dev->intx_lock);
> 
> Interesting that the new mutex is not used here.

KVM_DEV_ASSIGN_PCI_2_3 is immutable after setup, and we must disable the
line independent of what user space likes to see.

> 
>> +                     pci_intx(assigned_dev->dev, false);
> 
> Interesting.
> This will leave the device with interrupts masked.
> Do we reset it later?
> Maybe we want a comment explaining why it's done.
> 

Before releasing the device, we first reset it (which is supposed to
deassert the line). And then we restore the config space to the state
the device had before we grabed it. I can add a comment, but the logic
is mandatory (we can't release the line without disabling the source).

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html