Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Tue, 18 Mar 2025 22:19:25 -0400

Hi!

I recently came up with an interesting failure in the CI pipeline.

[  592.704446] WARNING: possible circular locking dependency detected 
[  592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted 
[  592.715764] ------------------------------------------------------ 
[  592.721946] swapper/19/0 is trying to acquire lock: 
[  592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0 
[  592.734761]  
[  592.734761] but task is already holding lock: 
[  592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
[  592.752185]  
[  592.752185] which lock already depends on the new lock. 
[  592.752185]  
[  592.760357]  
[  592.760357] the existing dependency chain (in reverse order) is: 
[  592.767837]  
[  592.767837] -> #2 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}: 
[  592.776027]        __lock_acquire+0xcac/0x1d20 
[  592.780489]        lock_acquire.part.0+0x11b/0x360 
[  592.785300]        _raw_spin_lock+0x37/0x80 
[  592.789484]        pi_enable_wakeup_handler+0x11a/0x5a0 [kvm_intel] 
[  592.795776]        vmx_vcpu_pi_put+0x1d3/0x230 [kvm_intel] 
[  592.801282]        vmx_vcpu_put+0x12/0x20 [kvm_intel] 
[  592.806360]        kvm_arch_vcpu_put+0x494/0x7a0 [kvm] 
[  592.811592]        kvm_sched_out+0x161/0x1c0 [kvm] 
[  592.816456]        prepare_task_switch+0x36c/0xe20 
[  592.821247]        __schedule+0x481/0x1a50 
[  592.825346]        schedule+0xd4/0x280 
[  592.829100]        kvm_vcpu_block+0xe4/0x1d0 [kvm] 
[  592.833964]        kvm_vcpu_halt+0x1a2/0x800 [kvm] 
[  592.838824]        vcpu_run+0x53f/0x9e0 [kvm] 
[  592.843263]        kvm_arch_vcpu_ioctl_run+0x371/0x1480 [kvm] 
[  592.849078]        kvm_vcpu_ioctl+0x45e/0xc70 [kvm] 
[  592.854034]        __x64_sys_ioctl+0x12e/0x1a0 
[  592.858481]        do_syscall_64+0x92/0x180 
[  592.862685]        entry_SYSCALL_64_after_hwframe+0x76/0x7e 
[  592.868276]  
[  592.868276] -> #1 (&rq->__lock)\{-.-.}-\{2:2}: 
[  592.874038]        __lock_acquire+0xcac/0x1d20 
[  592.878485]        lock_acquire.part.0+0x11b/0x360 
[  592.883277]        _raw_spin_lock_nested+0x3b/0x70 
[  592.888069]        raw_spin_rq_lock_nested+0x2e/0x130 
[  592.893121]        __task_rq_lock+0xdb/0x570 
[  592.897393]        wake_up_new_task+0x750/0x1050 
[  592.902014]        kernel_clone+0x13e/0x5f0 
[  592.906209]        user_mode_thread+0xa4/0xe0 
[  592.910568]        rest_init+0x1e/0x1c0 
[  592.914407]        start_kernel+0x3ab/0x3b0 
[  592.918592]        x86_64_start_reservations+0x24/0x30 
[  592.923732]        x86_64_start_kernel+0x9c/0xa0 
[  592.928353]        common_startup_64+0x13e/0x141 
[  592.932988]  
[  592.932988] -> #0 (&p->pi_lock)\{-.-.}-\{2:2}: 
[  592.938754]        check_prev_add+0x1b7/0x23e0 
[  592.943200]        validate_chain+0xa8a/0xf00 
[  592.947557]        __lock_acquire+0xcac/0x1d20 
[  592.952005]        lock_acquire.part.0+0x11b/0x360 
[  592.956797]        _raw_spin_lock_irqsave+0x46/0x90 
[  592.961676]        try_to_wake_up+0xa7/0x15c0 
[  592.966035]        rcuwait_wake_up+0x80/0x190 
[  592.970396]        kvm_vcpu_wake_up+0x19/0xa0 [kvm] 
[  592.975353]        pi_wakeup_handler+0xf4/0x130 [kvm_intel] 
[  592.980950]        sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0 
[  592.986783]        asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20 
[  592.992964]        finish_task_switch.isra.0+0x228/0x830 
[  592.998275]        __schedule+0x7c7/0x1a50 
[  593.002376]        schedule_idle+0x59/0x90 
[  593.006477]        do_idle+0x155/0x200 
[  593.010244]        cpu_startup_entry+0x54/0x60 
[  593.014693]        start_secondary+0x212/0x290 
[  593.019138]        common_startup_64+0x13e/0x141 
[  593.023758]  
[  593.023758] other info that might help us debug this: 
[  593.023758]  
[  593.031757] Chain exists of: 
[  593.031757]   &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) 
[  593.031757]  
[  593.043932]  Possible unsafe locking scenario: 
[  593.043932]  
[  593.049852]        CPU0                    CPU1 
[  593.054384]        ----                    ---- 
[  593.058918]   lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 
[  593.064576]                                lock(&rq->__lock);
[  593.070323]                                lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 
[  593.078496]   lock(&p->pi_lock); 
[  593.081745]  
[  593.081745]  *** DEADLOCK *** 
[  593.081745]  
[  593.087666] 2 locks held by swapper/19/0: 
[  593.091679]  #0: ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
[  593.103692]  #1: ffffffffa9b3eec0 (rcu_read_lock)\{....}-\{1:2}, at: rcuwait_wake_up+0x22/0x190 
[  593.112227]  

[  593.112227] stack backtrace: 
[  593.116590] CPU: 19 UID: 0 PID: 0 Comm: swapper/19 Kdump: loaded Not tainted 6.12.0-36.el10.x86_64+debug #1 
[  593.126338] Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024 
[  593.133818] Call Trace: 
[  593.136271]  <IRQ> 
[  593.138291]  dump_stack_lvl+0x6f/0xb0 
[  593.141965]  print_circular_bug.cold+0x38/0x48 
[  593.146427]  check_noncircular+0x308/0x3f0 
[  593.150545]  ? __pfx_check_noncircular+0x10/0x10 
[  593.155180]  ? __module_address+0x95/0x240 
[  593.159300]  ? alloc_chain_hlocks+0x33b/0x520 
[  593.163675]  check_prev_add+0x1b7/0x23e0 
[  593.167620]  validate_chain+0xa8a/0xf00 
[  593.171478]  ? __pfx_validate_chain+0x10/0x10 
[  593.175850]  ? mark_lock+0x78/0x860 
[  593.179360]  ? __pfx_lockdep_lock+0x10/0x10 
[  593.183549]  __lock_acquire+0xcac/0x1d20 
[  593.187494]  ? __pfx_validate_chain+0x10/0x10 
[  593.191867]  lock_acquire.part.0+0x11b/0x360 
[  593.196156]  ? try_to_wake_up+0xa7/0x15c0 
[  593.200172]  ? __pfx_lock_acquire.part.0+0x10/0x10 
[  593.204980]  ? rcu_is_watching+0x15/0xb0 
[  593.208907]  ? trace_lock_acquire+0x1b9/0x280 
[  593.213265]  ? try_to_wake_up+0xa7/0x15c0 
[  593.217294]  ? lock_acquire+0x31/0xc0 
[  593.220961]  ? try_to_wake_up+0xa7/0x15c0 
[  593.224975]  _raw_spin_lock_irqsave+0x46/0x90 
[  593.229351]  ? try_to_wake_up+0xa7/0x15c0 
[  593.233362]  try_to_wake_up+0xa7/0x15c0 
[  593.237220]  ? trace_lock_acquire+0x1b9/0x280 
[  593.241581]  ? __pfx_try_to_wake_up+0x10/0x10 
[  593.245958]  rcuwait_wake_up+0x80/0x190 
[  593.249816]  kvm_vcpu_wake_up+0x19/0xa0 [kvm] 
[  593.254260]  pi_wakeup_handler+0xf4/0x130 [kvm_intel] 
[  593.259338]  sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0 
[  593.264668]  </IRQ> 
[  593.266772]  <TASK> 
[  593.268880]  asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20 
[  593.274557] RIP: 0010:finish_task_switch.isra.0+0x228/0x830 
[  593.280147] Code: 41 ff 4d 00 0f 84 95 03 00 00 81 7d d0 80 00 00 00 0f 84 bd 03 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc <49> 8d be 00 05 00 00 48 b8 00 00 00 00 00 fc ff df
48 89 fa 48 c1 
[  593.298894] RSP: 0018:ffa0000000747cd8 EFLAGS: 00000206 
[  593.304137] RAX: 000000000016e47d RBX: ff1100079ec08e40 RCX: 0000000000000000 
[  593.311269] RDX: 0000000000000000 RSI: ffffffffa82f4140 RDI: ffffffffa85a2fa0 
[  593.318403] RBP: ffa0000000747d18 R08: 0000000000000001 R09: 0000000000000000 
[  593.325534] R10: 0000000000000001 R11: 0000000000000000 R12: ff110001b0e63f80 
[  593.332667] R13: 0000000000000000 R14: ff11000101dabf80 R15: ff1100079ec09e60 
[  593.339803]  ? finish_task_switch.isra.0+0x1b9/0x830 
[  593.344785]  ? __switch_to+0xc31/0xfa0 
[  593.348537]  ? __switch_to_asm+0x3d/0x70 
[  593.352483]  __schedule+0x7c7/0x1a50 
[  593.356077]  ? __pfx___schedule+0x10/0x10 
[  593.360107]  ? __pfx_sched_ttwu_pending+0x10/0x10 
[  593.364829]  ? rcu_is_watching+0x15/0xb0 
[  593.368777]  schedule_idle+0x59/0x90 
[  593.372370]  do_idle+0x155/0x200 
[  593.375605]  cpu_startup_entry+0x54/0x60 
[  593.379547]  start_secondary+0x212/0x290 
[  593.383483]  ? __pfx_start_secondary+0x10/0x10 
[  593.387945]  ? startup_64_load_idt+0xc1/0xf0 
[  593.392235]  common_startup_64+0x13e/0x141 
[  593.396355]  </TASK> 

I reproduced the failure upstream as well. 
In fact it reproduces well on a Sapphire Rapids system which I use for testing
('Intel(R) Xeon(R) Silver 4410Y')

To reproduce I just need to run the kvm selftests on a lockdep enabled kernel.

If I am not mistaken the following happens:

1. pi_enable_wakeup_handler() is called when the current vCPU is scheduled out

   it will then
	- *disable interrupts*
	- take a per-cpu lock 'wakeup_vcpus_on_cpu_lock' of the *current CPU*
	- add the current vCPU to the per-cpu list which is protected by the lock above.
	- enable interrupts

2. pi_wakeup_handler (IRQ handler) will also
	- take wakeup_vcpus_on_cpu_lock of *current CPU*
	- go over the vCPUs in the wakeup_vcpus_on_cpu list of *current CPU*, and kick them.

AFAIK, (1) and (2) can't race with each other and in fact no locking is needed here to prevent
a race.

The lock is required to prevent race with another place with needs access to the 'wakeup_vcpus_on_cpu' list:

(3)

When a sleeping vCPU is waken up it is scheduled to run (e.g. due to the kick above or other reason)
CPU, then  *vmx_vcpu_pi_load* will

	- disable interrupts (to avoid race with (1))
	- take wakeup_vcpus_on_cpu_lock (maybe of a different cpu)
	- remove this vCPU from the wakeup_vcpus_on_cpu list of that different CPU
	- release the lock and re-enable interrupts.

As far as I see, there is no race, but lockdep doesn't understand this.

It thinks that:

1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock

2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
(at the start of the function there is a list of cases when it takes it)

I don't know lockdep well yet, but maybe a lockdep annotation will help, 
if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.

What do you think?

Best regards,
	Maxim Levitsky