On Mon, Oct 24, 2011 at 05:00:27PM +0200, Jan Kiszka wrote: > On 2011-10-24 16:40, Michael S. Tsirkin wrote: > > On Mon, Oct 24, 2011 at 03:43:53PM +0200, Jan Kiszka wrote: > >> On 2011-10-24 15:11, Jan Kiszka wrote: > >>> On 2011-10-24 14:43, Michael S. Tsirkin wrote: > >>>> On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: > >>>>> On 2011-10-24 13:09, Avi Kivity wrote: > >>>>>> On 10/24/2011 12:19 PM, Jan Kiszka wrote: > >>>>>>>> > >>>>>>>> With the new feature it may be worthwhile, but I'd like to see the whole > >>>>>>>> thing, with numbers attached. > >>>>>>> > >>>>>>> It's not a performance issue, it's a resource limitation issue: With the > >>>>>>> new API we can stop worrying about user space device models consuming > >>>>>>> limited IRQ routes of the KVM subsystem. > >>>>>>> > >>>>>> > >>>>>> Only if those devices are in the same process (or have access to the > >>>>>> vmfd). Interrupt routing together with irqfd allows you to disaggregate > >>>>>> the device model. Instead of providing a competing implementation with > >>>>>> new limitations, we need to remove the limitations of the old > >>>>>> implementation. > >>>>> > >>>>> That depends on where we do the cut. Currently we let the IRQ source > >>>>> signal an abstract edge on a pre-allocated pseudo IRQ line. But we > >>>>> cannot build correct MSI-X on top of the current irqfd model as we lack > >>>>> the level information (for PBA emulation). *) > >>>> > >>>> > >>>> I don't agree here. IMO PBA emulation would need to > >>>> clear pending bits on interrupt status register read. > >>>> So clearing pending bits could be done by ioctl from qemu > >>>> while setting them would be done from irqfd. > >>> > >>> How should QEMU know if the reason for "pending" has been cleared at > >>> device level if the device is outside the scope of QEMU? This model only > >>> works for PV devices when you agree that spurious IRQs are OK. > >>> > >>>> > >>>>> So we either need to > >>>>> extend the existing model anyway -- or push per-vector masking back to > >>>>> the IRQ source. In the latter case, it would be a very good chance to > >>>>> give up on limited pseudo GSIs with static routes and do MSI messaging > >>>>> from external IRQ sources to KVM directly. > >>>>> But all those considerations affect different APIs than what I'm > >>>>> proposing here. We will always need a way to inject MSIs in the context > >>>>> of the VM as there will always be scenarios where devices are better run > >>>>> in that very same context, for performance or simplicity or whatever > >>>>> reasons. E.g., I could imagine that one would like to execute an > >>>>> emulated IRQ remapper rather in the hypervisor context than > >>>>> "over-microkernelized" in a separate process. > >>>>> > >>>>> Jan > >>>>> > >>>>> *) Realized this while trying to generalize the proposed MSI-X MMIO > >>>>> acceleration for assigned devices to arbitrary device models, vhost-net, > >>>> > >>>> I'm actually working on a qemu patch to get pba emulation working correctly. > >>>> I think it's doable with existing irqfd. > >>> > >>> irqfd has no notion of level. You can only communicate a rising edge and > >>> then need a side channel for the state of the edge reason. > >>> > >>>> > >>>>> and specifically vfio. > >>>> > >>>> Interesting. How would you clear the pseudo interrupt level? > >>> > >>> Ideally: not at all (for MSI). If we manage the mask at device level, we > >>> only need to send the message if there is actually something to deliver > >>> to the interrupt controller and masked input events would be lost on > >>> real HW as well. > >> > >> This wouldn't work out nicely as well. We rather need a combined model: > >> > >> Devices need to maintain the PBA actively, i.e. set & clear them > >> themselves and do not rely on the core here (with the core being either > >> QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only > >> checks the PBA if it is about to deliver some message and refrains from > >> doing so if the bit became 0 in the meantime (specifically during the > >> masked period). > >> > >> For QEMU device models, that means no additional IOCTLs, > >> just memory sharing of the PBA which is required anyway. > > > > Sorry, I don't understand the above two paragraphs. Maybe I am > > confused by terminology here. We really only need to check PBA when it's > > read. Whether the message is delivered only depends on the mask bit. > > This is what I have in mind: > - devices set PBA bit if MSI message cannot be sent due to mask (*) > - core checks&clears PBA bit on unmask, injects message if bit was set > - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? > The marked (*) lines differ from the current user space model where only > the core does PBA manipulation (including clearance via a special > function). Basically, the PBA becomes a communication channel also > between device and MSI core. And this model also works if core and > device run in different processes provided they set up the PBA as shared > memory. > > Jan > > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html