Re: [PATCH] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

Jan Kiszka <jan.kiszka@xxxxxxxxxxx> · Tue, 10 Jan 2012 19:21:01 +0100

On 2012-01-10 19:10, Michael S. Tsirkin wrote:
> On Tue, Jan 10, 2012 at 06:29:51PM +0100, Jan Kiszka wrote:
>> On 2012-01-10 17:17, Michael S. Tsirkin wrote:
>>> On Mon, Jan 09, 2012 at 03:03:00PM +0100, Jan Kiszka wrote:
>>>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>>>> enables us to share legacy IRQs of such devices with other host devices
>>>> when passing them to a guest.
>>>>
>>>> The new IRQ sharing feature introduced here is optional, user space has
>>>> to request it explicitly. Moreover, user space can inform us about its
>>>> view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
>>>> interrupt and signaling it if the guest masked it via the virtualized
>>>> PCI config space.
>>>>
>>>> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx>
>>>> ---
>>>>
>>>> This applies to kvm/master after merging
>>>>
>>>>   PCI: Rework config space blocking services
>>>>   PCI: Introduce INTx check & mask API
>>>>
>>>> from current linux-next/master. I suppose those two will make it into
>>>> 3.3.
>>>>
>>>> To recall the history of it: I tried hard to implement an adaptive
>>>> solution that automatically picks the fastest masking technique whenever
>>>> possible. However, the changes required to the IRQ core subsystem and
>>>> the logic of the device assignment code became so complex and partly
>>>> ugly that I gave up on this. It's simply not worth the pain given that
>>>> legacy PCI interrupts are rarely raised for performance critical device
>>>> at such a high rate (KHz...) that you can measure the difference.
>>>>
>>>>  Documentation/virtual/kvm/api.txt |   27 +++++
>>>>  arch/x86/kvm/x86.c                |    1 +
>>>>  include/linux/kvm.h               |    6 +
>>>>  include/linux/kvm_host.h          |    2 +
>>>>  virt/kvm/assigned-dev.c           |  208 +++++++++++++++++++++++++++++++-----
>>>>  5 files changed, 215 insertions(+), 29 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index e1d94bf..670015a 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -1159,6 +1159,14 @@ following flags are specified:
>>>>
>>>>  /* Depends on KVM_CAP_IOMMU */
>>>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU  (1 << 0)
>>>> +/* The following two depend on KVM_CAP_PCI_2_3 */
>>>> +#define KVM_DEV_ASSIGN_PCI_2_3               (1 << 1)
>>>> +#define KVM_DEV_ASSIGN_MASK_INTX     (1 << 2)
>>>> +
>>>> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
>>>> +via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
>>>> +assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
>>>> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>>>>
>>>>  The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
>>>>  isolation of the device.  Usages not specifying this flag are deprecated.
>>>> @@ -1399,6 +1407,25 @@ The following flags are defined:
>>>>  If datamatch flag is set, the event will be signaled only if the written value
>>>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>>>
>>>> +4.59 KVM_ASSIGN_SET_INTX_MASK
>>>> +
>>>> +Capability: KVM_CAP_PCI_2_3
>>>> +Architectures: x86
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_assigned_pci_dev (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +Informs the kernel about the guest's view on the INTx mask.
>>>
>>> A wild idea: since this is guest view of its IRQ,
>>> can this be specified per guest IRQ+id then?
>>> That might be useful to support MSIX mask bit emulation.
>>
>> I do not yet get the full idea: You want some generic
>> KVM_ASSIGN_SET_IRQ_MASK? What will be the use case in the MSI[X] area?
> 
> ATM writes to msi/msix mask bit have no effect for assigned
> devices. For virtio, they are implemented by deassigning irqfd
> which is a very slow operation (rcu write side).
> 
> Instead, When guest writes to mask, qemu can set/clear by calling
> this ioctl.

Isn't that effort better invested in proper in-kernel mask emulation for
MSI-X?

> 
>>>
>>>> As long as the
>>>> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>>>> +hardware level and will not assert the guest's IRQ line. User space is still
>>>> +responsible for applying this state to the assigned device's real config space.
>>>
>>> Can this be made more explicit? You mean writing into 1st
>>> byte of PCI control, right?
>>
>> For sure, I can state this.
>>
>>>
>>>> +To avoid that the kernel overwrites the state user space wants to set,
>>>> +KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config space.
>>>
>>> This looks like a strange requirement, could you explain how
>>> this helps avoid races?
>>
>> By declaring the target state of the INTx bit first to the kernel,
>> concurrent changes of the kernel while user space performs a
>> read-modify-write will not lead to an old mask state being written.
> 
> I note you don't require KVM_ASSIGN_SET_INTX_MASK before read though.
> Further, userspace might cache the control byte. If we require
> it not to do it, we probably need to be explicit?

User space can do with the control byte what it wants - kernel can't
help this anyway. I should just tell the kernel ahead of time what the
next INTx mask state will be. That particularly avoids that the kernel
sets the mask when user space wants it cleared. The other way around is
actually unproblematic as we check KVM_ASSIGN_SET_INTX_MASK before
delivering the IRQ to the guest.

> 
>>> This also raises questions about
>>> what should be done to write a bit unrelated to masking.
>>
>> Just write it, using the INTx state user space maintains. In the worst
>> case, some masking done by the kernel in the meantime will be
>> overwritten, leading to a single spurious but harmless IRQ. That event
>> won't be delivered to the guest unless it is ready to receive it - as we
>> updated the mask state prior to writing to the config space. The point
>> is that the kernel mechanism has to deal with crazy user space clearing
>> the mask for whatever reason again.
> 
> I guess the point is that we need to avoid is this:
> 
> kernel masks bit
> read
> kernel unmasks bit
> write
> 
> I'm not sure I understand how the text above suggests
> doing this in a race free manner.

User space must not write INTx as read from the hardware but according
to its own view. Then the above is harmless.

> 
> 
> A simple way would be to ask userspace to always clear
> this bit on writes. What do you think?

That or - sounds more consistent - writing the state that user space
exposes to the guest anyway. That (in addition to the ordering
requirement) should be clearly stated in the doc, I agree.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html