On Wed, Jan 23, 2019 at 08:07:33PM +0100, Cédric Le Goater wrote: > On 1/22/19 5:46 AM, Paul Mackerras wrote: > > On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote: > >> Hello, > >> > >> On the POWER9 processor, the XIVE interrupt controller can control > >> interrupt sources using MMIO to trigger events, to EOI or to turn off > >> the sources. Priority management and interrupt acknowledgment is also > >> controlled by MMIO in the CPU presenter subengine. > >> > >> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need > >> special support from the hypervisor to do the same. This is called the > >> XIVE native exploitation mode and today, it can be activated under the > >> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support > >> and still offers the old interrupt mode interface using a > >> XICS-over-XIVE glue which implements the XICS hcalls. > >> > >> The following series is proposal to add the same support under KVM. > >> > >> A new KVM device is introduced for the XIVE native exploitation > >> mode. It reuses most of the XICS-over-XIVE glue implementation > >> structures which are internal to KVM but has a completely different > >> interface. A set of Hypervisor calls configures the sources and the > >> event queues and from there, all control is done by the guest through > >> MMIOs. > >> > >> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU, > >> similarly to VFIO, and the associated VMAs are populated dynamically > >> with the appropriate pages using a fault handler. This is implemented > >> with a couple of KVM device ioctls. > >> > >> On a POWER9 sPAPR machine, the Client Architecture Support (CAS) > >> negotiation process determines whether the guest operates with a > >> interrupt controller using the XICS legacy model, as found on POWER8, > >> or in XIVE exploitation mode. Which means that the KVM interrupt > >> device should be created at runtime, after the machine as started. > >> This requires extra KVM support to create/destroy KVM devices. The > >> last patches are an attempt to solve that problem. > >> > >> Migration has its own specific needs. The patchset provides the > >> necessary routines to quiesce XIVE, to capture and restore the state > >> of the different structures used by KVM, OPAL and HW. Extra OPAL > >> support is required for these. > > > > Thanks for the patchset. It mostly looks good, but there are some > > more things we need to consider, and I think a v2 will be needed. > >> One general comment I have is that there are a lot of acronyms in this > > code and you mostly seem to assume that people will know what they all > > mean. It would make the code more readable if you provide the > > expansion of the acronym on first use in a comment or whatever. For > > example, one of the patches in this series talks about the "EAS" > > Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry) > > All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and > Linux should be adjusted ... > > > without ever expanding it in any comment or in the patch description, > > and I have forgotten just at the moment what EAS stands for (I just > > know that understanding the XIVE is not eas-y. :) > ah ! yes. But we have great documentation :) > > We pushed some high level description of XIVE in QEMU : > > https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD > > I should do the same for Linux with a KVM section to explain the > interfaces which do not directly expose the underlying XIVE concepts. > It's better to understand a little what is happening under the hood. > > > Another general comment is that you seem to have written all this > > code assuming we are using HV KVM in a host running bare-metal. > > Yes. I didn't look at the other configurations. I thought that we could > use the kernel_irqchip=off option to begin with. A couple of checks > are indeed missing. Using kernel_irqchip=off would mean that we would not be able to use the in-kernel XICS emulation, which would have a performance impact. We need an explicit capability for XIVE exploitation that can be enabled or disabled on the qemu command line, so that we can enforce a uniform set of capabilities across all the hosts in a migration domain. And it's no good to say we have the capability when all attempts to use it will fail. Therefore the kernel needs to say that it doesn't have the capability in a PR KVM guest or in a nested HV guest. > > However, we could be using PR KVM (either in a bare-metal host or in a > > guest), or we could be doing nested HV KVM where we are using the > > kvm_hv module inside a KVM guest and using special hypercalls for > > controlling our guests. > > Yes. > > It would be good to talk a little about the nested support (offline > may be) to make sure that we are not missing some major interface that > would require a lot of change. If we need to prepare ground, I think > the timing is good. > > The size of the IRQ number space might be a problem. It seems we > would need to increase it considerably to support multiple nested > guests. That said I haven't look much how nested is designed. The current design of nested HV is that the entire non-volatile state of all the nested guests is encapsulated within the state and resources of the L1 hypervisor. That means that if the L1 hypervisor gets migrated, all of its guests go across inside it and there is no extra state that L0 needs to be aware of. That would imply that the VP number space for the nested guests would need to come from within the VP number space for L1; but the amount of VP space we allocate to each guest doesn't seem to be large enough for that to be practical. > > It would be perfectly acceptable for now to say that we don't yet > > support XIVE exploitation in those scenarios, as long as we then make > > sure that the new KVM capability reports false in those scenarios, and > > any attempt to use the XIVE exploitation interfaces fails cleanly. > > ok. That looks the best approach for now. > > > I don't see that either of those is true in the patch set as it > > stands, so that is one area that needs to be fixed. > > > > A third general comment is that the new KVM interfaces you have added > > need to be documented in the files under Documentation/virtual/kvm. > > ok. > > Thanks, > > C. > Paul.