Split a single per_cpu lock on wakeup_list into a sched_in lock and a sched_out lock to break the possible circular locking dependency reported by lockdep. The lockdep complains about "possible circular locking dependency detected". Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** path irq, sysvec_kvm_posted_intr_wakeup_ipi() --> pi_wakeup_handler() --> kvm_vcpu_wake_up() --> try_to_wake_up(), the lock order is &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. path sched_out, vcpu_block() --> schedule() --> kvm_sched_out() --> vmx_vcpu_put() --> vmx_vcpu_pi_put() --> pi_enable_wakeup_handler(), the lock order is &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu). However, it is found out that path irq and sched_out are not racing because: path irq is in interrupt context, path sched_out is in interrupt disabled context, at the same pcpu as path irq. Consider path sched_out is the very path that tells lockdep the lock ordering: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu), it's desired for path irq not to hold the same per cpu lock as path sched_out. So, in the patch, a single wakeup_list lock is divided into a sched_in lock and a sched_out lock. - "path sched_out": add vcpu on pcpu (irq disabled) It takes sched_out lock. - "path irq": read vcpu list on pcpu (irq context, running on the same pcpu as "path sched_out") It only takes sched_in lock. - "path sched_in": delete vcpu on previous pCPU. (irq disabled, running on the same or different pCPU as "path irq") It takes sched_in and sched_out lock as it can race with the other two paths. (though in theory, it can never race with "path sched_out") The lock ordering after this patch are: - &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock_out, cpu) - &per_cpu(wakeup_vcpus_on_cpu_lock_in, cpu) --> &per_cpu(wakeup_vcpus_on_cpu_lock_out, cpu) - &per_cpu(wakeup_vcpus_on_cpu_lock_in, cpu) --> &p->pi_lock Currently, &rq->__lock is not held in "path sched_in". However, if in future "path sched_in" takes &p->pi_lock or &rq->__lock, lockdep is able to detect and warn in that case. Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> [sean: path sched_out and path irq does not race, path sched_in does not take &rq->__lock] Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> --- v2: switch from rcu to two raw_spin_locks. as rcu may not let the irq handler see list removal timely. (sean) v1: https://lore.kernel.org/all/20230310155955.29652-1-yan.y.zhao@xxxxxxxxx/ --- arch/x86/kvm/vmx/posted_intr.c | 32 ++++++++++++++++++++------------ 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 94c38bea60e7..92bbbfd161a0 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -23,13 +23,20 @@ */ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); /* - * Protect the per-CPU list with a per-CPU spinlock to handle task migration. + * Protect the per-CPU list with two per-CPU spinlocks to handle task migration. + * IRQs must be disabled when taking the two locks, otherwise deadlock will + * occur if a wakeup IRQ arrives and attempts to acquire the locks. + * ->sched_out() path before a vCPU blocking takes the "out lock", which will not + * be taken in the wakeup IRQ handler that running at the same pCPU as the + * ->sched_out() path. * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the * ->sched_in() path will need to take the vCPU off the list of the _previous_ - * CPU. IRQs must be disabled when taking this lock, otherwise deadlock will - * occur if a wakeup IRQ arrives and attempts to acquire the lock. + * CPU. It takes both "in lock" and "out lock" to take care of list racing of the + * _previous_ CPU. */ -static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); +static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock_in); +static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock_out); + static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) { @@ -57,7 +64,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) struct pi_desc old, new; unsigned long flags; unsigned int dest; - /* * To simplify hot-plug and dynamic toggling of APICv, keep PI.NDST and * PI.SN up-to-date even if there is no assigned device or if APICv is @@ -89,9 +95,11 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) * current pCPU if the task was migrated. */ if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock_in, vcpu->cpu)); + raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock_out, vcpu->cpu)); list_del(&vmx->pi_wakeup_list); - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock_out, vcpu->cpu)); + raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock_in, vcpu->cpu)); } dest = cpu_physical_id(cpu); @@ -152,10 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) local_irq_save(flags); - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock_out, vcpu->cpu)); list_add_tail(&vmx->pi_wakeup_list, &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock_out, vcpu->cpu)); WARN(pi_desc->sn, "PI descriptor SN field set before blocking"); @@ -219,12 +227,11 @@ void pi_wakeup_handler(void) { int cpu = smp_processor_id(); struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu); - raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu); + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock_in, cpu); struct vcpu_vmx *vmx; raw_spin_lock(spinlock); list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) { - if (pi_test_on(&vmx->pi_desc)) kvm_vcpu_wake_up(&vmx->vcpu); } @@ -234,7 +241,8 @@ void pi_wakeup_handler(void) void __init pi_init_cpu(int cpu) { INIT_LIST_HEAD(&per_cpu(wakeup_vcpus_on_cpu, cpu)); - raw_spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); + raw_spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock_in, cpu)); + raw_spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock_out, cpu)); } bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu) base-commit: 89400df96a7570b651404bbc3b7afe627c52a192 -- 2.17.1