On 20/06/2019 09:57, Laurent Vivier wrote: > On 20/06/2019 03:46, Suraj Jitindar Singh wrote: >> If we enter an L1 guest with a pending decrementer exception then this >> is cleared on guest exit if the guest has writtien a positive value into >> the decrementer (indicating that it handled the decrementer exception) >> since there is no other way to detect that the guest has handled the >> pending exception and that it should be dequeued. In the event that the >> L1 guest tries to run a nested (L2) guest immediately after this and the >> L2 guest decrementer is negative (which is loaded by L1 before making >> the H_ENTER_NESTED hcall), then the pending decrementer exception >> isn't cleared and the L2 entry is blocked since L1 has a pending >> exception, even though L1 may have already handled the exception and >> written a positive value for it's decrementer. This results in a loop of >> L1 trying to enter the L2 guest and L0 blocking the entry since L1 has >> an interrupt pending with the outcome being that L2 never gets to run >> and hangs. >> >> Fix this by clearing any pending decrementer exceptions when L1 makes >> the H_ENTER_NESTED hcall since it won't do this if it's decrementer has >> gone negative, and anyway it's decrementer has been communicated to L0 >> in the hdec_expires field and L0 will return control to L1 when this >> goes negative by delivering an H_DECREMENTER exception. >> >> Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests" >> >> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@xxxxxxxxx> >> --- >> arch/powerpc/kvm/book3s_hv.c | 11 +++++++++-- >> 1 file changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 719fd2529eec..4a5eb29b952f 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -4128,8 +4128,15 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, >> >> preempt_enable(); >> >> - /* cancel pending decrementer exception if DEC is now positive */ >> - if (get_tb() < vcpu->arch.dec_expires && kvmppc_core_pending_dec(vcpu)) >> + /* >> + * cancel pending decrementer exception if DEC is now positive, or if >> + * entering a nested guest in which case the decrementer is now owned >> + * by L2 and the L1 decrementer is provided in hdec_expires >> + */ >> + if (kvmppc_core_pending_dec(vcpu) && >> + ((get_tb() < vcpu->arch.dec_expires) || >> + (trap == BOOK3S_INTERRUPT_SYSCALL && >> + kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED))) >> kvmppc_core_dequeue_dec(vcpu); >> >> trace_kvm_guest_exit(vcpu); >> > > Patches 2 and 3: tested I can boot and run an L2 nested guest with qemu > v4.0.0 and caps-large-decr=on in the case we have had a hang previously. > > Tested-by: Laurent Vivier <lvivier@xxxxxxxxxx> You beat me to it. All works fine on L0, L1, L2. Tested-by: Cédric Le Goater <clg@xxxxxxxx> With a QEMU-4.1. In this configuration, L2 runs with the XIVE (emulated) interrupt mode by default now (kernel_irqchip=allowed, ic-mode=dual). Thanks, C.