Re: [kvmarm] [RFC PATCH 0/3] KVM: ARM: Get rid of hardcoded VGIC addresses

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Sat, 27 Oct 2012 07:45:02 +1100

On Fri, 2012-10-26 at 14:39 +0200, Jan Kiszka wrote:

> But we are just talking about sending messages from A to B or soldering
> an input to an output pin. That's pretty generic. Give each output event
> a virtual IRQ number and define where its output "line" should be linked
> to (input pin of target controller). All what will be specific are the
> IDs of those controllers.

Hrm you seem to be saying something very different from Paolo here.
Unless it's just a very very confused terminology.

So let's see the powerpc "pseries" case. Things like embedded etc...
might be quite different.

We have essentially two "outputs" here. One is qemu itself shooting
interrupts (emulated devices, virtio, etc...). This is an ioctl and that
gives you a global interrupt number. So this goes directly to the source
controller which then uses it's internal logic to send that to the
presentation controller in ways that are entirely implementation
specific.

The specific source controller is located using the top bits of the
global interrupt number (the BUID). When we create source controllers,
we pass as argument to the ioctl the BUID for that source controller and
the number of interrupts it handles.

The other "output" is irqfd for kernel originated events. Here I assume
there's an in-kernel way to directly call a function rather than queue
something for qemu to consume later, anything else would be horribly
wasteful. Here too, what we need here is a global interrupt number, so
we can find the source controller by BUID and shoot it the interrupt.

So that's the only case I see where we need an association of some kind,
which is irqfs -> global number. I don't see where the "MSIs" that Paolo
keep talking about come into play. User space (emulated) MSIs are dealt
within qemu entirely and MSIs from VFIO end up as irqfd.

Finally there is the "routing" between a given interrupt source (an
entry in the source controller state table) and the target processor
(the corresponding presentation controller).

That routing is purely a field in the source controller field, which is
there along with the interrupt priority and a few state bits. (We don't
need to deal with level/edge because of the way the ICS work, we just
say at the time of the triggering of an interrupt whether it's a level
set, level reset, or message, and it will do the right thing).

This field is accessed (programmed) by the guest using a firmware
interface that is implemented in the kernel part of KVM. It's a platform
specific API and it accesses the source controller (it's implemented
three really). I don't see where any generic API here would make sense
other than maybe adding useless bloat.

The only place where qemu might "see" that stuff is for migration where
it needs to save all the state of all the sources and restore it on the
target.

The actual communication between source controllers and presentation
controllers is also entirely platform specific. It follows a somewhat
specified protocol (we mimmic what the HW actually does) and here too, I
see no room for anything generic. 

> Of course, all that provided you do their emulation in kernel space. For
> x86, that even makes sense when the IRQ sources are in user space as the
> guest may still have to interact during IRQ delivery with IOAPIC, thus
> we save some costly heavy-weight exits when putting it in the kernel.

We have a way to lower that cost. Since the interaction with the
presentation controller is done by hypervisor calls, we handle them
directly in real mode within the guest MMU context unless some
exceptional condition is hit (such as the need to trigger a resend from
one of the source controllers or an interrupt rejection).

> > 
> > Remains the "routing" between source of "events" and actual "inputs" to
> > a source controller.
> > 
> > This too doesn't seem totally obvious to generalize. For example an
> > embedded platform with a bunch of cascaded dumb interrupt controllers
> > doesn't have a concept of a flat number space in HW, an interrupt
> > "input" to be identified properly, needs to identify the controller and
> > the interrupt within that controller. However, within KVM/qemu, it's
> > pretty easy to assign to each controller a number and by collating the
> > two, get some kind of flat space, though it's not arbitrary and the
> > routing is thus fairly constrained if not totally fixed.
> 
> IRQ routing entry:
>  - virq number ("gsi")
>  - type (controller ID, MSI, whatever you like)

What is "controller ID" ? That doesn't mean anything to me. In our case,
the specific source controller is known from the virq number (the top
bits of it basically).

>  - some flags (to extend it)
>  - type-specific data (MSI message, controller input pin, etc.)

I don't understand that business about MSIs really. I suppose it has to
do with the way you do old-style device assignment ? Either MSIs come
from virtual/emulated devices in which case they are a qemu fiction and
qemu just sends us an ioctl with the virq number or they come from real
devices in which case they are setup normally by the host kernel using
host kernel MSI addresses, and we catch them via irqfd (or some platform
specific bypass that we might implement in the future).

We do have some per-interrupt data I mentioned earlier, the target
presentation controller (known as server ID, basically the HW CPU number
of the target) and a few bits of state. We only ever access it from qemu
for migration though.

> And there can be multiple entries with the same virq, thus you can
> deliver to multiple targets. I bet you can model quite a lot of your
> platform specific routing this way. I'm not saying our generic code will
> work out of the box, but at least the interfaces and concepts are there.

I don't see how we can model anything using that. Qemu doesn't actually
look or modify any of that state other than during migration anyway. We
do have a concept of delivery to multiple targets via a "link" mechanism
which allows an interrupt to bounce to another target within a "ring" if
the original target is busy (that's 2 bits of state) but this too is
configured via firmware interfaces that are handled entirely in the
kernel, not in qemu.

> > In the pseries case, the global number is split in two bit fields, the
> > BUID identifying the specific source controller and the source within
> > that controller. Here too it's fairly fixed though. So the ioctl we use
> > to create a source controller in the kernel takes the BUID as an
> > argument, and from there the kernel will "find" the right source
> > controller based solely on the interrupt number.
> > 
> > So basically on one side we have a global interrupt number that
> > identifies an "input", I assume that's what x86 calls a GSI ?
> 
> Right. The virtual IRQ numbers we call "GSI" is partially occupied by
> the actual x86-GSIs (0..n, with n=23 so far), directed to the IOAPIC and
> PIC there, and then followed by IRQs that are mapped on MSI messages.
> But that's just how we _use_ it on x86, not how it has to work for other
> archs.
> 
> > 
> > Remains how to associate the various sources of interrupts to that
> > 'global number'... and that is fairly specific to each source type isn't
> > it ?
> > 
> > In our current powerpc code, the emulated devices toggle the qirq which
> > ends up shooting an ioctl to set/reset or "message" (for MSIs) the
> > corresponding global interrupt. The mapping is established entirely
> > within qemu, we just tell the kernel to trigger a given interrupt.
> > 
> > We haven't really sorted vhost out yet so I'm not sure how that will
> > work out but the idea would be to have an ioctl to associate an eventfd
> > or whatever vhost uses as interrupt "outputs" with a global interrupt
> > number.
> 
> KVM_IRQFD is already there. It associates an irqfd file descriptor with
> a virtual IRQ. Once that triggers, the IRQ routing table is used to
> define the actual interrupt type and destination chip to use, see above.

We only need irqfd to virtual irq. The rests doesn't make sense to me
(the "routing table bit").

> > For pass-through, currently our VFIO is dumb, interrupts get to qemu
> > which then shoots them back to the kernel using the standard qirq stuff
> > used by emulated devices. Here I suppose we would want something similar
> > to vhost to associate the VFIO irq fd with a "global number".
> > 
> > Is that what the existing ioctl's provide ? Their semantics aren't
> > totally obvious to me.
> 
> Provided you want to trigger a MSI message, you first need to register
> it via kvm_irqchip_add_msi_route (will trigger KVM_SET_GSI_ROUTING).

Why ? Again, I don't get it.

> That will give you a virtual IRQ number which can be associate with an
> irqfd file descriptor as explained above (KVM_IRQFD). 

But virq numbers are entirely under qemu control. qemu creates the
source controllers, and assign the virq numbers to the devices. 

I really don't quite get how that concept of "GSI routing" means
anything for us. We certainly don't want to have the kernel return a
virq number to qemu.

That whole business with MSIs makes very little sense to me.

> But you may also
> create a different kind of routing table entry if MSI is not all you
> need to inject via irqfd. Could be a plain IRQ line as well, routed to a
> specific in-kernel IRQ controller model.

Sorry, I must be totally stupid or something but I don't understand.
Doesn't make sense to me.

> > Note that for pass-through at least, and possibly for vhost, we'd like
> > to actually totally bypass the irqfd & eventfd stuff for performance
> > reasons. At least for VFIO, if we are going to get the max performance
> > out of it, we need to take all generic code out of the picture. IE. If
> > the interrupts are routed to the physical CPU where the guest is
> > running, we want to be able to catch and distribute the interrupts to
> > the guest entirely within guest context, ie, with KVM arch specific low
> > level code that runs in "real mode" (ie MMU off) without context
> > switching the MMU back to the host, which on POWER is fairly costly.
> > 
> > That means that at least the association between a guest global
> > interrupt number and a host global interrupt number for pass-through
> > will be something that goes entirely through arch specific code path. We
> > might still be able to use generic APIs to establish it if they are
> > suitable though.
> 
> The same will happen on x86: direct injection to a target VCPU. Maybe
> again a topic for our IRQ routing table, just with specialized target types.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html