Hi, we're facing an interesting issue with a BPF program that writes into a bpf_ringbuf from different CPUs on an RT kernel. Here is my attempt to reproduce on QEMU: ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rt5-g66834e17536e #3 Not tainted ------------------------------------------------------ swapper/4/0 is trying to acquire lock: ffffc90006b4d118 (&lock->wait_lock){....}-{2:2}, at: rt_spin_lock+0x6d/0x100 but task is already holding lock: ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&rb->spinlock){....}-{2:2}: lock_acquire+0xc5/0x300 rt_spin_lock+0x2a/0x100 __bpf_ringbuf_reserve+0x5a/0xf0 bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d bpf_trace_run4+0xae/0x1e0 __schedule+0x42c/0xca0 preempt_schedule_notrace+0x37/0x60 preempt_schedule_notrace_thunk+0x1a/0x30 rcu_is_watching+0x32/0x40 __flush_work+0x30b/0x480 n_tty_poll+0x131/0x1d0 tty_poll+0x54/0x90 do_select+0x490/0x9b0 core_sys_select+0x238/0x620 kern_select+0x101/0x190 __x64_sys_select+0x21/0x30 do_syscall_64+0xbc/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #2 (&rq->__lock){-...}-{2:2}: lock_acquire+0xc5/0x300 _raw_spin_lock_nested+0x2e/0x40 raw_spin_rq_lock_nested+0x15/0x30 task_fork_fair+0x3e/0xb0 sched_cgroup_fork+0xe9/0x110 copy_process+0x1b76/0x2fd0 kernel_clone+0xab/0x3e0 user_mode_thread+0x5f/0x90 rest_init+0x1e/0x160 start_kernel+0x61d/0x620 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0x8c/0x90 common_startup_64+0x13e/0x148 -> #1 (&p->pi_lock){-...}-{2:2}: lock_acquire+0xc5/0x300 _raw_spin_lock+0x30/0x40 rtlock_slowlock_locked+0x130/0x1c70 rt_spin_lock+0x78/0x100 prepare_to_wait_event+0x1a/0x140 wake_up_and_wait_for_irq_thread_ready+0xc3/0xe0 __setup_irq+0x374/0x660 request_threaded_irq+0xe5/0x180 acpi_os_install_interrupt_handler+0xb7/0xe0 acpi_ev_install_xrupt_handlers+0x22/0x90 acpi_init+0x8f/0x4d0 do_one_initcall+0x73/0x2d0 kernel_init_freeable+0x24a/0x290 kernel_init+0x1a/0x130 ret_from_fork+0x31/0x50 ret_from_fork_asm+0x1a/0x30 -> #0 (&lock->wait_lock){....}-{2:2}: check_prev_add+0xeb/0xd80 __lock_acquire+0x113e/0x15b0 lock_acquire+0xc5/0x300 _raw_spin_lock_irqsave+0x3c/0x60 rt_spin_lock+0x6d/0x100 __bpf_ringbuf_reserve+0x5a/0xf0 bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d bpf_trace_run4+0xae/0x1e0 __schedule+0x42c/0xca0 schedule_idle+0x20/0x40 cpu_startup_entry+0x29/0x30 start_secondary+0xfa/0x100 common_startup_64+0x13e/0x148 other info that might help us debug this: Chain exists of: &lock->wait_lock --> &rq->__lock --> &rb->spinlock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&rb->spinlock); lock(&rq->__lock); lock(&rb->spinlock); lock(&lock->wait_lock); *** DEADLOCK *** 3 locks held by swapper/4/0: #0: ffff88813bd32558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0 #1: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0 #2: ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0 stack backtrace: CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.9.0-rt5-g66834e17536e #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 print_circular_bug.cold+0x178/0x1be check_noncircular+0x14e/0x170 check_prev_add+0xeb/0xd80 __lock_acquire+0x113e/0x15b0 lock_acquire+0xc5/0x300 ? rt_spin_lock+0x6d/0x100 _raw_spin_lock_irqsave+0x3c/0x60 ? rt_spin_lock+0x6d/0x100 rt_spin_lock+0x6d/0x100 __bpf_ringbuf_reserve+0x5a/0xf0 bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d bpf_trace_run4+0xae/0x1e0 __schedule+0x42c/0xca0 schedule_idle+0x20/0x40 cpu_startup_entry+0x29/0x30 start_secondary+0xfa/0x100 common_startup_64+0x13e/0x148 </TASK> CPU: 1 PID: 160 Comm: screen Not tainted 6.9.0-rt5-g66834e17536e #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 __might_resched.cold+0xcc/0xdf rt_spin_lock+0x4c/0x100 ? __bpf_ringbuf_reserve+0x5a/0xf0 __bpf_ringbuf_reserve+0x5a/0xf0 bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d bpf_trace_run4+0xae/0x1e0 __schedule+0x42c/0xca0 preempt_schedule_notrace+0x37/0x60 preempt_schedule_notrace_thunk+0x1a/0x30 ? __flush_work+0x84/0x480 rcu_is_watching+0x32/0x40 __flush_work+0x30b/0x480 n_tty_poll+0x131/0x1d0 tty_poll+0x54/0x90 do_select+0x490/0x9b0 ? __bfs+0x136/0x230 ? do_select+0x26d/0x9b0 ? __pfx_pollwake+0x10/0x10 ? __pfx_pollwake+0x10/0x10 ? core_sys_select+0x238/0x620 core_sys_select+0x238/0x620 kern_select+0x101/0x190 __x64_sys_select+0x21/0x30 do_syscall_64+0xbc/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The BPF program in question is attached to sched_switch. The issue seems to be similar to a couple of syzkaller reports [1], [2], although the latter one is about nested progs, which seems to be not the case here. Talking about nested progs, applying a similar approach as in [3] reworked for bpf_ringbuf, elliminates the issue. Do I miss anything, is it a known issue? Any ideas how to address that? [1]: https://lore.kernel.org/all/0000000000000656bf061a429057@xxxxxxxxxx/ [2]: https://lore.kernel.org/lkml/0000000000004aa700061379547e@xxxxxxxxxx/ [3]: https://lore.kernel.org/bpf/20240514124052.1240266-2-sidchintamaneni@xxxxxxxxx/