Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

Paul Mackerras <paulus@xxxxxxxxx> · Sat, 16 Feb 2013 15:51:16 +1100

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
> On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
> >I have no particular objection to the device control API per se, but
> >I have two objections to using it as the primary interface to the XICS
> >emulation.
> >
> >First, I dislike the magical side-effect where creating a device of a
> >particular type (e.g. MPIC or XICS) automatically attaches it to the
> >interrupt lines of the vcpus.  I prefer an explicit request to do
> >in-kernel interrupt control.
> 
> OK.  This is device-specific behavior, so you could define it
> differently for XICS than MPIC.  I suppose we could change it for
> MPIC as well, to leave an opening for the unlikely case where we'd
> want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

> How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic "create device" ioctl that could be for any device.

> >Secondly, it means that we are completely abandoning any attempt to
> >define an abstract or generic interface to in-kernel interrupt
> >controller emulations.  Each device will have its own unique set of
> >attribute groups and its own unique userspace code to drive it, with
> >no commonality between them.
> 
> Yes.  I am unconvinced that such an abstraction is well-advised
> (especially after seeing existing "generic" interfaces that are
> clearly APIC-oriented).  This isn't like normal driver interfaces
> where we're abstracting away hardware differences to let generic
> code use a device.  Userspace knows what kind of device it wants,
> and how it wants it to integrate with the rest of the emulated
> system.  We'd have to go out of our way to apply the abstraction on
> *both* ends.  What do we get from that other than a chance that the
> abstraction leaks?  What significant code actually becomes common?
> kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

> >> OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
> >> hoping to avoid going through a standardized interface that forces a
> >> global interrupt numberspace.
> >
> >Why?
> 
> The standardized interface doesn't make things any easier (as noted
> above, the caller is already mpic-specific code), and we'd have to
> come up with a scheme for flattening our interrupt numberspace
> (rather than introduce new attribute groups for things like IPI and
> timer interrupts).  It may still be necessary when it comes to
> irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

> >> How do MSIs get injected?
> >
> >Just like other interrupts - from the point of view of the interrupt
> >controller they're edge-triggered interrupt sources.
> 
> Ah right, I guess this is all set up via hcalls for XICS.
> 
> With MPIC exposing its registers via the device control api,
> everything just works -- the PCI device generates a write to the
> MPIC's memory region, the QEMU MPIC stub sends the write to the
> kernel as for any other MMIO access (this passthrough is also useful
> for debugging), the in-kernel MPIC sees the write to the "generate
> an MSI" register and does its thing.  Compare that to all special
> the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

> >There are plenty of bits free in the 64 bits per source that I have
> >allowed.  We can accommodate those things.
> 
> MPIC vector numbers take up 16 of the bits.  The architected
> interrupt level field is 8 bits, though only a handful of values are
> actually needed.  Add a couple binary flags, and it gets pretty
> tight if a third type of interrupt controller starts wanting
> something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

> >Yes, the names of the bitfields in the ICP state word are
> >XICS-specific, but the concepts are pretty generic - current processor
> >priority, identifier for interrupt awaiting service, pending IPI
> >request priority, pending interrupt request priority.
> 
> We don't have separate concepts of "pending IPI request priority"
> and "pending interrupt request priority".  There can be multiple

Sorry, I meant "pending interrupt request".  You do have that, it's
what you read from the interrupt acknowledge register.

> interrupts awaiting service (or even in service, if different
> priorities).  We have both "current task priority" (which is a
> user-set mask-by-priority register) and the priority of the
> highest-prio in-service interrupt -- which would "current processor
> priorty" be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.  (XICS doesn't maintain the stack in hardware, which
would require 256 bits of state, but instead gets software to do
that.)

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html