Excerpts from Alexey Kardashevskiy's message of March 23, 2021 8:13 pm: > > > On 23/03/2021 12:02, Nicholas Piggin wrote: >> irq_work's use of the DEC SPR is racy with guest<->host switch and guest >> entry which flips the DEC interrupt to guest, which could lose a host >> work interrupt. >> >> This patch closes one race, and attempts to comment another class of >> races. >> >> Signed-off-by: Nicholas Piggin <npiggin@xxxxxxxxx> >> --- >> arch/powerpc/kvm/book3s_hv.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 1f38a0abc611..989a1ff5ad11 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -3745,6 +3745,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit, >> if (!(vcpu->arch.ctrl & 1)) >> mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1); >> >> + /* >> + * When setting DEC, we must always deal with irq_work_raise via NMI vs >> + * setting DEC. The problem occurs right as we switch into guest mode >> + * if a NMI hits and sets pending work and sets DEC, then that will >> + * apply to the guest and not bring us back to the host. >> + * >> + * irq_work_raise could check a flag (or possibly LPCR[HDICE] for >> + * example) and set HDEC to 1? That wouldn't solve the nested hv >> + * case which needs to abort the hcall or zero the time limit. >> + * >> + * XXX: Another day's problem. >> + */ >> mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb); >> >> if (kvmhv_on_pseries()) { >> @@ -3879,7 +3891,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit, >> vc->entry_exit_map = 0x101; >> vc->in_guest = 0; >> >> - mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - tb); >> + set_dec_or_work(local_paca->kvm_hstate.dec_expires - tb); > > > set_dec_or_work() will write local_paca->kvm_hstate.dec_expires - tb - 1 > to SPRN_DEC which is not exactly the same, is this still alright? > > I asked in v3 but it is probably lost :) Oh I did see that then forgot. It will write dec_expires - tb, then it will write 1 if it found irq_work was pending. The change is intentional to fixes one of the lost irq_work races. Thanks, Nick