Excerpts from Cédric Le Goater's message of March 23, 2021 2:01 am: > On 3/22/21 2:15 PM, Nicholas Piggin wrote: >> Excerpts from Alexey Kardashevskiy's message of March 22, 2021 5:30 pm: >>> >>> >>> On 06/03/2021 02:06, Nicholas Piggin wrote: >>>> In the interest of minimising the amount of code that is run in>>> "real-mode", don't handle hcalls in real mode in the P9 path. >>>> >>>> POWER8 and earlier are much more expensive to exit from HV real mode >>>> and switch to host mode, because on those processors HV interrupts get >>>> to the hypervisor with the MMU off, and the other threads in the core >>>> need to be pulled out of the guest, and SLBs all need to be saved, >>>> ERATs invalidated, and host SLB reloaded before the MMU is re-enabled >>>> in host mode. Hash guests also require a lot of hcalls to run. The >>>> XICS interrupt controller requires hcalls to run. >>>> >>>> By contrast, POWER9 has independent thread switching, and in radix mode >>>> the hypervisor is already in a host virtual memory mode when the HV >>>> interrupt is taken. Radix + xive guests don't need hcalls to handle >>>> interrupts or manage translations. > > Do we need to handle the host-is-a-P9-without-xive case ? I'm not sure really. Is there an intention for OPAL to be able to provide a fallback layer in the worst case? Maybe microwatt grows HV capability before XIVE? > >>>> So it's much less important to handle hcalls in real mode in P9. >>> >>> So acde25726bc6034b (which added if(kvm_is_radix(vcpu->kvm))return >>> H_TOO_HARD) can be reverted, pretty much? >> >> Yes. Although that calls attention to the fact I missed doing >> a P9 h_random handler in this patch. I'll fix that, then I think >> acde2572 could be reverted entirely. >> >> [...] >> >>>> } else { >>>> kvmppc_xive_push_vcpu(vcpu); >>>> trap = kvmhv_load_hv_regs_and_go(vcpu, time_limit, lpcr); >>>> - kvmppc_xive_pull_vcpu(vcpu); >>>> + /* H_CEDE has to be handled now, not later */ >>>> + /* XICS hcalls must be handled before xive is pulled */ >>>> + if (trap == BOOK3S_INTERRUPT_SYSCALL && >>>> + !(vcpu->arch.shregs.msr & MSR_PR)) { >>>> + unsigned long req = kvmppc_get_gpr(vcpu, 3); >>>> >>>> + if (req == H_CEDE) { >>>> + kvmppc_cede(vcpu); >>>> + kvmppc_xive_cede_vcpu(vcpu); /* may un-cede */ >>>> + kvmppc_set_gpr(vcpu, 3, 0); >>>> + trap = 0; >>>> + } >>>> + if (req == H_EOI || req == H_CPPR || >>> >>> else if (req == H_EOI ... ? >> >> Hummm, sure. > > you could integrate the H_CEDE in the switch statement below. Below is in a different file just for the emulation calls. >> >> [...] >> >>>> +void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu) >>>> +{ >>>> + void __iomem *esc_vaddr = (void __iomem *)vcpu->arch.xive_esc_vaddr; >>>> + >>>> + if (!esc_vaddr) >>>> + return; >>>> + >>>> + /* we are using XIVE with single escalation */ >>>> + >>>> + if (vcpu->arch.xive_esc_on) { >>>> + /* >>>> + * If we still have a pending escalation, abort the cede, >>>> + * and we must set PQ to 10 rather than 00 so that we don't >>>> + * potentially end up with two entries for the escalation >>>> + * interrupt in the XIVE interrupt queue. In that case >>>> + * we also don't want to set xive_esc_on to 1 here in >>>> + * case we race with xive_esc_irq(). >>>> + */ >>>> + vcpu->arch.ceded = 0; >>>> + /* >>>> + * The escalation interrupts are special as we don't EOI them. >>>> + * There is no need to use the load-after-store ordering offset >>>> + * to set PQ to 10 as we won't use StoreEOI. >>>> + */ >>>> + __raw_readq(esc_vaddr + XIVE_ESB_SET_PQ_10); >>>> + } else { >>>> + vcpu->arch.xive_esc_on = true; >>>> + mb(); >>>> + __raw_readq(esc_vaddr + XIVE_ESB_SET_PQ_00); >>>> + } >>>> + mb(); >>> >>> >>> Uff. Thanks for cut-n-pasting the comments, helped a lot to match this c >>> to that asm! >> >> Glad it helped. >>>> +} > > I had to do the PowerNV models in QEMU to start understanding that stuff ... > >>>> +EXPORT_SYMBOL_GPL(kvmppc_xive_cede_vcpu); >>>> + >>>> /* >>>> * This is a simple trigger for a generic XIVE IRQ. This must >>>> * only be called for interrupts that support a trigger page >>>> @@ -2106,6 +2140,32 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type) >>>> return 0; >>>> } >>>> >>>> +int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req) >>>> +{ >>>> + struct kvmppc_vcore *vc = vcpu->arch.vcore; >>> >>> >>> Can a XIVE enabled guest issue these hcalls? Don't we want if >>> (!kvmppc_xics_enabled(vcpu)) and >>> if (xics_on_xive()) here, as kvmppc_rm_h_xirr() have? Some of these >>> hcalls do write to XIVE registers but some seem to change >>> kvmppc_xive_vcpu. Thanks, >> >> Yes I think you're right, good catch. I'm not completely sure about all >> the xive and xics modes but a guest certainly can make any kind of hcall >> it likes and we have to sanity check it. > > Yes. > >> We want to take the hcall here (in replacement of the real mode hcalls) >> with the same condition. So it would be: >> >> if (!kvmppc_xics_enabled(vcpu)) >> return H_TOO_HARD; > > Yes. > > This test covers the case in which a vCPU does XICS hcalls without QEMU > having connected the vCPU to a XICS ICP. The ICP is the KVM XICS device > on P8 or XICS-on-XIVE on P9. It catches QEMU errors when the interrupt > mode is negotiated, we don't want the OS to do XICS hcalls after having > negotiated the XIVE interrupt mode. Okay. > It's different for the XIVE hcalls (when running under XICS) because they > are all handled in QEMU. XIVE guest hcalls running on XICS host? >> if (!xics_on_xive()) >> return H_TOO_HARD; > > I understand that this code is only called on P9 and with translation on. Yes. > On P9, we could have xics_on_xive() == 0 if XIVE is disabled at compile > time or with "xive=off" at boot time. But guests should be supported. > I don't see a reason to restrict the support even if these scenarios > are rather unusual if not very rare. > > on P10, it's the same but since we don't have the XICS emulation layer > in OPAL, the host will be pretty useless. We don't care. > > Since we are trying to handle hcalls, this is L0 and it can not be called > for nested guests, which would be another case of xics_on_xive() == 0. > We don't care either. Okay so no xics_on_xive() test. I'll change that. Thanks, Nick