Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips

Jan Kiszka <jan.kiszka@xxxxxxxxxxx> · Mon, 24 Oct 2011 17:00:27 +0200

On 2011-10-24 16:40, Michael S. Tsirkin wrote:
> On Mon, Oct 24, 2011 at 03:43:53PM +0200, Jan Kiszka wrote:
>> On 2011-10-24 15:11, Jan Kiszka wrote:
>>> On 2011-10-24 14:43, Michael S. Tsirkin wrote:
>>>> On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote:
>>>>> On 2011-10-24 13:09, Avi Kivity wrote:
>>>>>> On 10/24/2011 12:19 PM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>> With the new feature it may be worthwhile, but I'd like to see the whole
>>>>>>>> thing, with numbers attached.
>>>>>>>
>>>>>>> It's not a performance issue, it's a resource limitation issue: With the
>>>>>>> new API we can stop worrying about user space device models consuming
>>>>>>> limited IRQ routes of the KVM subsystem.
>>>>>>>
>>>>>>
>>>>>> Only if those devices are in the same process (or have access to the
>>>>>> vmfd).  Interrupt routing together with irqfd allows you to disaggregate
>>>>>> the device model.  Instead of providing a competing implementation with
>>>>>> new limitations, we need to remove the limitations of the old
>>>>>> implementation.
>>>>>
>>>>> That depends on where we do the cut. Currently we let the IRQ source
>>>>> signal an abstract edge on a pre-allocated pseudo IRQ line. But we
>>>>> cannot build correct MSI-X on top of the current irqfd model as we lack
>>>>> the level information (for PBA emulation). *)
>>>>
>>>>
>>>> I don't agree here. IMO PBA emulation would need to
>>>> clear pending bits on interrupt status register read.
>>>> So clearing pending bits could be done by ioctl from qemu
>>>> while setting them would be done from irqfd.
>>>
>>> How should QEMU know if the reason for "pending" has been cleared at
>>> device level if the device is outside the scope of QEMU? This model only
>>> works for PV devices when you agree that spurious IRQs are OK.
>>>
>>>>
>>>>> So we either need to
>>>>> extend the existing model anyway -- or push per-vector masking back to
>>>>> the IRQ source. In the latter case, it would be a very good chance to
>>>>> give up on limited pseudo GSIs with static routes and do MSI messaging
>>>>> from external IRQ sources to KVM directly.
>>>>> But all those considerations affect different APIs than what I'm
>>>>> proposing here. We will always need a way to inject MSIs in the context
>>>>> of the VM as there will always be scenarios where devices are better run
>>>>> in that very same context, for performance or simplicity or whatever
>>>>> reasons. E.g., I could imagine that one would like to execute an
>>>>> emulated IRQ remapper rather in the hypervisor context than
>>>>> "over-microkernelized" in a separate process.
>>>>>
>>>>> Jan
>>>>>
>>>>> *) Realized this while trying to generalize the proposed MSI-X MMIO
>>>>> acceleration for assigned devices to arbitrary device models, vhost-net,
>>>>
>>>> I'm actually working on a qemu patch to get pba emulation working correctly.
>>>> I think it's doable with existing irqfd.
>>>
>>> irqfd has no notion of level. You can only communicate a rising edge and
>>> then need a side channel for the state of the edge reason.
>>>
>>>>
>>>>> and specifically vfio.
>>>>
>>>> Interesting. How would you clear the pseudo interrupt level?
>>>
>>> Ideally: not at all (for MSI). If we manage the mask at device level, we
>>> only need to send the message if there is actually something to deliver
>>> to the interrupt controller and masked input events would be lost on
>>> real HW as well.
>>
>> This wouldn't work out nicely as well. We rather need a combined model:
>>
>> Devices need to maintain the PBA actively, i.e. set & clear them
>> themselves and do not rely on the core here (with the core being either
>> QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only
>> checks the PBA if it is about to deliver some message and refrains from
>> doing so if the bit became 0 in the meantime (specifically during the
>> masked period).
>>
>> For QEMU device models, that means no additional IOCTLs,
>> just memory sharing of the PBA which is required anyway.
> 
> Sorry, I don't understand the above two paragraphs. Maybe I am
> confused by terminology here. We really only need to check PBA when it's
> read.  Whether the message is delivered only depends on the mask bit.

This is what I have in mind:
 - devices set PBA bit if MSI message cannot be sent due to mask (*)
 - core checks&clears PBA bit on unmask, injects message if bit was set
 - devices clear PBA bit if message reason is resolved before unmask (*)

The marked (*) lines differ from the current user space model where only
the core does PBA manipulation (including clearance via a special
function). Basically, the PBA becomes a communication channel also
between device and MSI core. And this model also works if core and
device run in different processes provided they set up the PBA as shared
memory.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html