On Tue, Jan 07, 2025, Binbin Wu wrote: > On 1/7/2025 11:24 AM, Chao Gao wrote: > > > Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception > > > being disabled to detect that it's worth trying to post a timer interrupt, but off > > > the top of my head I'm pretty sure that's unnecessary and pointless. The > > Commit 1714a4eb6fb0 gives an explanation: > > > > if only some guests isolated and others not, they would not > > have any benefit from posted timer interrupts, and at the same time lose > > VMX preemption timer fast paths because kvm_can_post_timer_interrupt() > > returns true and therefore forces kvm_can_use_hv_timer() to false. But that's only relevant for setup. Upon expiration, if the target vCPU is in guest mode and APICv is active, then from_timer_fn must be true, which in turn means that hrtimers are in use. > Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and > hlt_in_guest for non TDX guest. The code doesn't work for TDX guests. > - Whether mwait in guest is allowed for TDX depends on the cpuid > configuration in TD_PARAMS, not the capability of physical cpu checked by > kvm_can_mwait_in_guest(). > - hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false > for TDX guests. > > For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest() > or add special handling (i.e., skip checking mwait/hlt in guest) because > VMX preemption timer can't be used anyway, in order to allow housekeeping > CPU to post timer interrupt for TDX vCPUs when nohz_full is configured > after changing APICv active to true. Yes, but I don't think we need any TDX-specific logic beyond noting that the VMX preemption can't be used. As above, consulting kvm_can_post_timer_interrupt() in the expiration path is silly. And after staring at all of this for a few hours, I think we should ditch the entire lapic_timer.pending approach, because at this point it's doing more harm than good. Tracking pending timer IRQs was primarily done to support delivery of *all* timer interrupts when multiple interrupts were coalesced in the host, e.g. because a vCPU was scheduled out. But that got removed by commit f1ed0450a5fa ("KVM: x86: Remove support for reporting coalesced APIC IRQs"), partly because posted interrupts threw a wrench in things, but also because delivering multiple periodic interrupts in quick succession is problematic in its own right. With the interrupt coalescing logic gone, I can't think of any reason KVM needs to kick the vCPU out to the main vcpu_run() loop just to deliver the interrupt, i.e. pending interrupts before delivering them is unnecessary. E.g. conditioning direct deliverly on apicv_active in the !from_timer_fn case makes no sense, because KVM is going to manually move the interrupt from the PIR to the IRR anyways. I also don't like that the behavior for the posting path is subtly different than !APICv. E.g. if an hrtimer fires while KVM is handling a write to TMICT, KVM will deliver the interrupt if configured to post timer, but not if APICv is disabled, because the latter will increment "pending", and "pending" will be cleared before handling the new TMICT. Ditto for switch APIC timer modes. Unfortunately, there is a lot to disentangle before KVM can directly deliver all APIC timer interupts, as KVM has built up a fair bit of ugliness on top of "pending". E.g. when switching back to the VMX preemption timer (after a blocking vCPU is scheduled back in), KVM leaves the hrtimer active. start_hv_timer() accounts for that by checking lapic_timer.pending, but leaving the hrtimer running in this case is absurd and unnecessarily relies on "pending" being incremented. if (kvm_x86_call(set_hv_timer)(vcpu, ktimer->tscdeadline, &expired)) return false; ktimer->hv_timer_in_use = true; hrtimer_cancel(&ktimer->timer); /* * To simplify handling the periodic timer, leave the hv timer running * even if the deadline timer has expired, i.e. rely on the resulting * VM-Exit to recompute the periodic timer's target expiration. */ if (!apic_lvtt_period(apic)) { /* * Cancel the hv timer if the sw timer fired while the hv timer * was being programmed, or if the hv timer itself expired. */ if (atomic_read(&ktimer->pending)) { cancel_hv_timer(apic); } else if (expired) { apic_timer_expired(apic, false); cancel_hv_timer(apic); } } I'm also not convinced that letting the hrtimer run on a different CPU in the "normal" case is a good idea. KVM explicitly migrates the timer, *except* for the posting case, and so not pinning the hrtimer is likely counter-productive, not to mention confusing. void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct hrtimer *timer; if (!lapic_in_kernel(vcpu) || kvm_can_post_timer_interrupt(vcpu)) return; timer = &vcpu->arch.apic->lapic_timer.timer; if (hrtimer_cancel(timer)) hrtimer_start_expires(timer, HRTIMER_MODE_ABS_HARD); } And if the hrtimer is pinned to the pCPU, then in practice it should be impossible for KVM to post a timer IRQ into a vCPU that's in guest mode. Seapking of which, posting a timer IRQ into a vCPU that's in guest mode is a bit dodgy. I _think_ it's safe, because all of the flows that can be coincident are actually mutually exclusive. E.g. apic_timer_expired()'s call to __kvm_wait_lapic_expire() can't collide with the call from the entry path, as the entry path's invocation happens with IRQs disabled. But all of this makes me wary, so I'd much prefer to clean up, harden, and document the existing code before we try to optimize timer IRQ delivery for more cases. That's especially true for TDX as we're already adding significant compexlity and multiple novel paths for TDX; we don't need more at this time :-)