Oh, kvm_arch_vcpu_runnable(), how I love thee. Let me count the ways: 1) We must confirm the observability of the loads of 'power_off', 'pause', and 'irq_lines', as they are used in the condition to stop waiting. They're OK, because the check is done after prepare_to_swait(), which calls set_current_state(), and all wakers do swake_up(). swake_up() results in a try_to_wake_up(), which is designed to pair with set_current_state(). Actually, the smp_mb() in kvm_psci_vcpu_on() is not even necessary. Note, 'irq_lines' is awaken through a VCPU kick, which calls kvm_vcpu_wake_up(). kvm_vcpu_wake_up() uses barriers correctly since 5e0018b3e39e "kvm: Serialize wq active checks in kvm_vcpu_wake_up()". 2) We should consider replacing the check of 'irq_lines' with a check of the VCPU request, IRQ_PENDING. VCPU request IRQ_PENDING and 'irq_lines' are not equivalent. While IRQ_PENDING is always set when 'irq_lines' changes, runnable() only checks if 'irq_lines' is nonzero, meaning a change from nonzero to zero will not awaken a VCPU. This makes sense, because 'level=1' for KVM_IRQ_LINE always means the IRQ is asserted, and 'level=0' always means deasserted. If we intend to check IRQ_PENDING in runnable() at all, then we need to stop making the IRQ_PENDING request when KVM_IRQ_LINE is used to deassert an IRQ. If we stop making the request on deassert, and start checking IRQ_PENDING in runnable(), then the 'irq_lines' check may be removed. 3) We should consider replacing the call of kvm_vgic_vcpu_pending_irq() with a check of the VCPU request, IRQ_PENDING. kvm_vgic_vcpu_pending_irq() iterates the VCPU's "Active or Pending IRQ" list, looking for any pending and enabled IRQ, returning true if it finds one. So, the questions are if every time there is at least one pending and enabled IRQ on the VCPU's list, the VCPU will also have its IRQ_PENDING request bit set, and, conversely, if IRQ_PENDING will ever be set even when there are no pending and enabled IRQs. It appears IRQ_PENDING will be set every time an IRQ is added to the list, thanks to vgic_queue_irq_unlock(), however, while IRQs may be dropped or migrated from the VCPU's list, the IRQ_PENDING request would remain. That said, it's of little consequence for the VCPU to wake up due to an IRQ_PENDING request when there are no longer pending IRQs. Indeed, the same race of a pending IRQ being dropped/migrated immediately prior to the VCPU flushing it's VGIC hardware state is present with the kvm_vgic_vcpu_pending_irq() check as well. Now, if a host IRQ handler sets IRQ_PENDING to inform the guest to wake up and inject an IRQ, then on wake up, while IRQ_PENDING will be set, the IRQ will not yet be in the list. This means kvm_vgic_vcpu_pending_irq(), alone, may not be sufficient to determine when to wake up. It can be fixed, though, by either checking specific pending state in runnable(), like 'irq_lines', or by adding the state to the checks done in kvm_vgic_vcpu_pending_irq() - as is proposed in the patch "KVM: arm/arm64: GICv4: Use pending_last as a scheduling hint". The barrier analysis done for (1) would apply to either fix, as long as the IRQ handler wakes up the VCPU with kvm_vcpu_wake_up() or kvm_vcpu_kick(), which calls kvm_vcpu_wake_up(). 4) We must confirm the compound condition sufficiently determines when a VCPU should be considered runnable. It does, but it's not obvious, as it depends on the contexts of the two callers. Conceptually, runnable() should be "not waiting, or waiting for interrupts and interrupts are pending": !waiting-uninterruptable && (!waiting-for-interrupts || interrupts-pending) More concisely !W_u && (!W_i || I) But, the implementation is only !W_u && I Context1: Called from kvm_vcpu_block() we know W_i=1 !W_u && (!W_i || I) => !W_u && (!(1) || I) => !W_u && I Context2: Called from kvm_vcpu_on_spin() we know (W_u || W_i) = 1, which implies W_i=1 if W_u=0 !W_u && (!W_i || I) => !(0) && (!(1) || I) = I, W_u=0 !(1) && (!W_i || I) = 0, W_u=1 => !W_u && I 5) We must confirm the compound condition is not missing any conditions; all conditions for waiting-uninterruptable and for interrupts-pending must be present. waiting-uninterruptable is power_off || pause which is correct, as there is currently no other way to request uninterruptable waiting, a.k.a sleep. interrupts-pending is !!irq_lines || kvm_vgic_vcpu_pending_irq() which is mostly correct. It is not correct regarding the vPMU. If a VCPU's perf event overflow handler were to fire after the VCPU started waiting, then the wake up done by the kvm_vcpu_kick() call in the handler would do nothing, as no "pmu overflow" state is checked in runnable(). We can fix this by adding and checking pending state, as described in (3), or by deciding to just check IRQ_PENDING in runnable(), and then fix 'irq_lines' instead, as described in (2). ======== Proposal ======== 1) Add a VCPU request to be used by KVM_IRQ_LINE to deassert the line. The new VCPU request will be flagged with KVM_REQUEST_NO_WAKEUP, and the IRQ_PENDING request will not be set on deasserts. 2) Make runnable() more obvious and future proof by adding an explicit check for the 'waiting-for-interrupts' VCPU state. This can be done with a new 'halted' boolean that gets set when emulating WFI. 3) Just check the IRQ_PENDING VCPU request for runnable()'s 'interrupts-pending' check. 4) Clean up all the 'power_off || pause' checks scattered throughout virt/kvm/arm/arm.c by wrapping the condition in a function. === Sorry for TL;DR mail. If the analysis and proposal look reasonable, then I'll send patches. Thanks, drew _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm