Sorry for sending the reply twice. The former reply got rejected since I forgot to turn on the plain text email setting. On Tue, Sep 12, 2023 at 10:07 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Sep 12, 2023, Haitao Shan wrote: > > This issue exists in kernel version 6.3-rc1 or above. The issue is > > introduced by the commit 8e6ed96cdd50 ("KVM: x86: fire timer when it is > > migrated and expired, and in oneshot mode"). The issue occurs on Intel > > platform which APIC virtualization and posted interrupt processing. > > I think the bug was actually introduced by: > > 967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC") Thanks for pointing this out. I know commit 8e6ed96cdd50 is only a trigger. But I did not go one step further and find out where the bug is coming from. > > Fixing the "deadline <= 0" handling just made it much easier to be hit. E.g. if > the deadline was '1' during restore, set_target_expiration() would set tscdeadline > to T1+(1*N), where T1 is the current TSC and N is the multipler to get from nanoseconds > to cycles. start_sw_tscdeadline() (or vmx_set_hv_timer()) would then reread the > TSC (call it T2), see T2 > T1+(1*N), and mark the timer as expired. > > > The issue is first discovered when running the Android Emulator which > > is based on QEMU 2.12. I can reproduce the issue with QEMU 8.0.4 in > > Debian 12. > > The above is helpful as extra context, but repeating "This issue" and "The issue" > over and over without ever actually describing what the issue actualy is makes it > quite difficult to understand what is actually being fixed. Got it. I will rewrite the whole commit message for v2. > > > With the above commit, the timer gets fired immediately inside the > > KVM_SET_LAPIC call when loading the snapshot. On such Intel platforms, > > this eventually leads to setting the corresponding PIR bit. However, > > the whole PIR bits get cleared later in the same KVM_SET_LAPIC call. > > Missing lapic timer interrupt freeze the guest kernel. > > Please phrase changelogs as commands and state what is actually being changed. > Again, the context on what is broken is helpful, but the changelog really, really > needs to state what is being changed. Will do. > > > Signed-off-by: Haitao Shan <hshan@xxxxxxxxxx> > > --- > > arch/x86/kvm/lapic.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index a983a16163b1..6f73406b875a 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -2977,14 +2977,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) > > apic_update_lvtt(apic); > > apic_manage_nmi_watchdog(apic, kvm_lapic_get_reg(apic, APIC_LVT0)); > > update_divide_count(apic); > > - __start_apic_timer(apic, APIC_TMCCT); > > - kvm_lapic_set_reg(apic, APIC_TMCCT, 0); > > kvm_apic_update_apicv(vcpu); > > if (apic->apicv_active) { > > static_call_cond(kvm_x86_apicv_post_state_restore)(vcpu); > > static_call_cond(kvm_x86_hwapic_irr_update)(vcpu, apic_find_highest_irr(apic)); > > static_call_cond(kvm_x86_hwapic_isr_update)(apic_find_highest_isr(apic)); > > } > > + __start_apic_timer(apic, APIC_TMCCT); > > + kvm_lapic_set_reg(apic, APIC_TMCCT, 0); > > I don't think this is the ordering we want. It currently works, but it subtly > "relies" on a few things: > > 1. That vmx_deliver_posted_interrupt() never "fails" when APICv is enabled, > i.e. never puts the interrupt in the IRR instead of the PIR. > > 2. The SVM, a.k.a. AVIC, doesn't ever sync from the IRR to a separate "hardware" > virtual APIC, because unlike VMX, SVM does set the bit in the IRR. > > I doubt #2 will ever change simply because that's tied to how AVIC works, and #1 > shouldn't actually break anything since the fallback path in vmx_deliver_interrupt() > needs to be self-sufficient, but I don't like that the code syncs from the IRR and > _then_ potentially modifies the IRR. > > I also don't like doing additional APIC state restoration _after_ invoking the > post_state_restore() hook. Updating APICv in the middle of the restore flow is > going to be brittle and difficult to maintain, e.g. it won't be obvious what > needs to go before and what needs to go after. > > IMO, vmx_apicv_post_state_restore() is blatantly broken. It is most definitely > not doing "post state restore" stuff, it's simply purging state, i.e. belongs in > a "pre state restore" hook. > > So rather than shuffle around the timer code, I think we should instead add yet > another kvm_x86_ops hook, e.g. apicv_pre_state_restore(), and have initialize > the PI descriptor there. > > Aha! And I think the new apicv_pre_state_restore() needs to be invoked even if > APICv is not active, because I don't see anything that purges the PIR when APICv > is enabled. VMX's APICv doesn't have many inhibits that can go away, and I > highly doubt userspace will restore into a vCPU with pending posted interrupts, > so in practice this is _extremely_ unlikely to be problematic. But it's still > wrong. Thanks for sharing what you would like to fix the bug. I will write a v2 for that. Actually, I am sorry that I forgot to add RFC to the title, as I personally did not think the proposed fix looks clean. I am surprised that apic_post_state_restore actually clears the whole PIR which looks like "resetting" instead of "restoring". However, I am not sure whether this is the exact intended behavior and code sequence. If it is intended, __restart_apic_timer should defer its interrupt delivery action after apic_post_state_restore (something like raising a request for updating PIR when vcpu_enter_guest). I will work on v2 following your suggestion. Thanks! -- Haitao @Google