On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> >From: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> >
> >This adds in-kernel emulation of the XICS (eXternal Interrupt
> >Controller Specification) interrupt controller specified by PAPR,
for
> >both HV and PR KVM guests.
> >
> >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> >should use in-kernel interrupt controller emulation, but also
takes an
> >argument struct that contains the type of interrupt controller
> >architecture and an optional parameter. Currently only one type
value
> >is defined, that which indicates the XICS architecture.
>
> Would the device config API I posted a couple days ago work for you?
I suppose it could be made to work. It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required. The XICS is operated from
the guest side via hcalls, not via emulated MMIO.
I don't think it makes such an assumption. The MPIC device has
physical registers, so it exposes them, but it also exposes things that
are not physical registers (e.g. the per-IRQ input state). The generic
device control layer leaves interpretation of attributes up to the
device.
I think it would be easier to fit XICS into the device control api
model than to fit MPIC into this model, not to mention what would
happen if we later want to emulate some other type of device -- x86
already has at least one non-irqchip emulated device (i8254).
Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting). Your device API seems to be quite different to that.
I wasn't there either. It's fine to have discussions at such events,
but it should not preclude others from having input, or better ideas
from being considered afterward.
> >The XICS emulation supports up to 1048560 interrupt sources.
> >Interrupt source numbers below 16 are reserved; 0 is used to mean
no
> >interrupt and 2 is used for IPIs. Internally these are
represented in
> >blocks of 1024, called ICS (interrupt controller source) entities,
but
> >that is not visible to userspace.
> >
> >Two other new ioctls allow userspace to control the interrupt
> >sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
> >destination cpu, level/edge sensitivity and pending state of a
range
> >of interrupt sources, creating them if they don't already exist.
The
> >KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range
of
> >interrupt sources (they are required to already exist).
>
> Why is it userspace's job to control these? If you use
KVM_IRQ_PENDING
These are primarily there to support live migration. For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.
In that case, the state it describes is insufficient for MPIC.
We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration,
We don't yet, but would prefer not to assume that it'll never happen.
> for interrupt injection, what if there's a race with the user
changing
> other flags via MMIO? Maybe this isn't an issue with XICS, but
this is
> being presented as a generic API.
They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.
OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping
to avoid going through a standardized interface that forces a global
interrupt numberspace.
How do MSIs get injected?
BTW, do you have any plans regarding irqfd?
> >+struct kvm_irq_sources {
> >+ __u32 irq;
> >+ __u32 nr_irqs;
> >+ __u64 __user *irqbuf;
> >+};
>
> Please no pointers in UAPI -- this would require a compat wrapper
with
> 32-bit user and 64-bit kernel.
Hmmm, you're right. I suppose it will have to be a fixed-size buffer,
It doesn't need to be a fixed size buffer. You can still have
pointers, but they need to be represented as a plain "__u64" with users
casting the pointer.
> >+/* irqbuf entries are laid out like this: */
> >+#define KVM_IRQ_SERVER_SHIFT 0
> >+#define KVM_IRQ_SERVER_MASK 0xffffffffULL
> >+#define KVM_IRQ_PRIORITY_SHIFT 32
> >+#define KVM_IRQ_PRIORITY_MASK 0xff
> >+#define KVM_IRQ_LEVEL_SENSITIVE (1ULL << 40)
> >+#define KVM_IRQ_MASKED (1ULL << 41)
> >+#define KVM_IRQ_PENDING (1ULL << 42)
>
> What does "server" mean? Do you mean "laid out like this for XICS"?
Sorry, I should have made that "destination" rather than "server".
You're right, "server" is confusing, but it just means "where the
interrupt is sent to be handled". It has nothing particularly to do
with "server" computers.
Right, I was aware of the IBM terminology here -- just thought it
wasn't appropriate in a generic interface.
What about interrupt controllers that allow multiple destinations?
More than 256 priorities? Different "levels" of output (normal,
critical, machine check)? Programmable vector numbers? Active
high/low control?
I just don't think irqchip state can be sanely made generic.
> Let's please have a clean separation between what is generic and
what is
> implementation-specific.
I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when. That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.
The per-vcpu state isn't even part of this AFAICT. It's an
XICS-specific ONE_REG -- which is fine, but all that's left of the
"generic" API is the get/set sources which is an imperfect match to our
per-IRQ state and it's not clear how an implementation should extend it.
It would be straightforward to map it to the device control api by
having XICS declare an attribute group that corresponds to IRQ sources
(like KVM_DEV_MPIC_GRP_IRQ_ACTIVE) and the attribute data be what
you've defined for kvm_irq_sources. Or if you want to preserve the
"set multiple IRQs at once" approach, you could have an attribute that
takes exactly a kvm_irq_sources struct, though it might be better to
have a generic batched attribute set/get ioctl (as has been proposed
for ONE_REG if the need arises).
MPIC would still need an extra interface for userspace to save and
restore things like the timer registers at live migration time, and
for userspace to configure the base address, MPIC version, etc., of
course.
Yes, which pretty much means we need the device control api anyway --
or MPIC-specific ioctls, which I wanted to avoid.
-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html