On 1/30/19 6:40 AM, Paul Mackerras wrote: > On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote: >>>>> Another general comment is that you seem to have written all this >>>>> code assuming we are using HV KVM in a host running bare-metal. >>>> >>>> Yes. I didn't look at the other configurations. I thought that we could >>>> use the kernel_irqchip=off option to begin with. A couple of checks >>>> are indeed missing. >>> >>> Using kernel_irqchip=off would mean that we would not be able to use >>> the in-kernel XICS emulation, which would have a performance impact. >> >> yes. But it is not supported today. Correct ? > > Not correct, it has been working for years, and works in v5.0-rc1 (I > just tested it), at both L0 and L1. Please see other email for the test is did. >>> We need an explicit capability for XIVE exploitation that can be >>> enabled or disabled on the qemu command line, so that we can enforce a >>> uniform set of capabilities across all the hosts in a migration >>> domain. And it's no good to say we have the capability when all >>> attempts to use it will fail. Therefore the kernel needs to say that >>> it doesn't have the capability in a PR KVM guest or in a nested HV >>> guest. >> >> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability >> for future use. > > That's not what I meant. Why do we need that? I meant that querying > the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a > guest. It should only return 1 if we are running bare-metal on a P9. ok. I guess I need to understand first how the nested guest uses the KVM IRQ device. That is a question in another email thread. >>>>> However, we could be using PR KVM (either in a bare-metal host or in a >>>>> guest), or we could be doing nested HV KVM where we are using the >>>>> kvm_hv module inside a KVM guest and using special hypercalls for >>>>> controlling our guests. >>>> >>>> Yes. >>>> >>>> It would be good to talk a little about the nested support (offline >>>> may be) to make sure that we are not missing some major interface that >>>> would require a lot of change. If we need to prepare ground, I think >>>> the timing is good. >>>> >>>> The size of the IRQ number space might be a problem. It seems we >>>> would need to increase it considerably to support multiple nested >>>> guests. That said I haven't look much how nested is designed. >>> >>> The current design of nested HV is that the entire non-volatile state >>> of all the nested guests is encapsulated within the state and >>> resources of the L1 hypervisor. That means that if the L1 hypervisor >>> gets migrated, all of its guests go across inside it and there is no >>> extra state that L0 needs to be aware of. That would imply that the >>> VP number space for the nested guests would need to come from within >>> the VP number space for L1; but the amount of VP space we allocate to >>> each guest doesn't seem to be large enough for that to be practical. >> >> If the KVM XIVE device had some information on the max number of CPUs >> provisioned for the guest, we could optimize the VP allocation. > > The problem is that we might have 1000 guests running under L0, or we > might have 1 guest running under L0 and 1000 guests running under it, > and we have no way to know which situation to optimize for at the > point where an L1 guest starts. If we had an enormous VP space then > we could just give each L1 guest a large amount of VP space and solve > it that way; but we don't. There are some ideas to increase our VP space size. Using multiblock per XIVE chip in skiboot is one I think. It's not an obvious change. Also, XIVE2 will add more bits to the NVT index so we will be free to allocate more at once when P10 is available. On the same topic, may be we could move the VP allocator from skiboot to KVM, allocate the full VP space at the KVM level and let KVM do the VP segmentation. Any how, I think that if we knew how much VPs we need to provision for when the KVM XIVE device is created, we would make a better use of the available space. Shouldn't we ? Thanks, C.