bpf_ringbuf_reserve deadlock on rt kernels

Dmitry Dolgov <9erthalion6@xxxxxxxxx> · Mon, 10 Jun 2024 17:17:35 +0200

Hi,

we're facing an interesting issue with a BPF program that writes into a
bpf_ringbuf from different CPUs on an RT kernel. Here is my attempt to
reproduce on QEMU:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.9.0-rt5-g66834e17536e #3 Not tainted
    ------------------------------------------------------
    swapper/4/0 is trying to acquire lock:
    ffffc90006b4d118 (&lock->wait_lock){....}-{2:2}, at: rt_spin_lock+0x6d/0x100

    but task is already holding lock:
    ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&rb->spinlock){....}-{2:2}:
           lock_acquire+0xc5/0x300
           rt_spin_lock+0x2a/0x100
           __bpf_ringbuf_reserve+0x5a/0xf0
           bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
           bpf_trace_run4+0xae/0x1e0
           __schedule+0x42c/0xca0
           preempt_schedule_notrace+0x37/0x60
           preempt_schedule_notrace_thunk+0x1a/0x30
           rcu_is_watching+0x32/0x40
           __flush_work+0x30b/0x480
           n_tty_poll+0x131/0x1d0
           tty_poll+0x54/0x90
           do_select+0x490/0x9b0
           core_sys_select+0x238/0x620
           kern_select+0x101/0x190
           __x64_sys_select+0x21/0x30
           do_syscall_64+0xbc/0x1d0
           entry_SYSCALL_64_after_hwframe+0x77/0x7f

    -> #2 (&rq->__lock){-...}-{2:2}:
           lock_acquire+0xc5/0x300
           _raw_spin_lock_nested+0x2e/0x40
           raw_spin_rq_lock_nested+0x15/0x30
           task_fork_fair+0x3e/0xb0
           sched_cgroup_fork+0xe9/0x110
           copy_process+0x1b76/0x2fd0
           kernel_clone+0xab/0x3e0
           user_mode_thread+0x5f/0x90
           rest_init+0x1e/0x160
           start_kernel+0x61d/0x620
           x86_64_start_reservations+0x24/0x30
           x86_64_start_kernel+0x8c/0x90
           common_startup_64+0x13e/0x148

    -> #1 (&p->pi_lock){-...}-{2:2}:
           lock_acquire+0xc5/0x300
           _raw_spin_lock+0x30/0x40
           rtlock_slowlock_locked+0x130/0x1c70
           rt_spin_lock+0x78/0x100
           prepare_to_wait_event+0x1a/0x140
           wake_up_and_wait_for_irq_thread_ready+0xc3/0xe0
           __setup_irq+0x374/0x660
           request_threaded_irq+0xe5/0x180
           acpi_os_install_interrupt_handler+0xb7/0xe0
           acpi_ev_install_xrupt_handlers+0x22/0x90
           acpi_init+0x8f/0x4d0
           do_one_initcall+0x73/0x2d0
           kernel_init_freeable+0x24a/0x290
           kernel_init+0x1a/0x130
           ret_from_fork+0x31/0x50
           ret_from_fork_asm+0x1a/0x30

    -> #0 (&lock->wait_lock){....}-{2:2}:
           check_prev_add+0xeb/0xd80
           __lock_acquire+0x113e/0x15b0
           lock_acquire+0xc5/0x300
           _raw_spin_lock_irqsave+0x3c/0x60
           rt_spin_lock+0x6d/0x100
           __bpf_ringbuf_reserve+0x5a/0xf0
           bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
           bpf_trace_run4+0xae/0x1e0
           __schedule+0x42c/0xca0
           schedule_idle+0x20/0x40
           cpu_startup_entry+0x29/0x30
           start_secondary+0xfa/0x100
           common_startup_64+0x13e/0x148

    other info that might help us debug this:

    Chain exists of:
      &lock->wait_lock --> &rq->__lock --> &rb->spinlock

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&rb->spinlock);
                                   lock(&rq->__lock);
                                   lock(&rb->spinlock);
      lock(&lock->wait_lock);

     *** DEADLOCK ***

    3 locks held by swapper/4/0:
     #0: ffff88813bd32558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0
     #1: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0
     #2: ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0

    stack backtrace:
    CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.9.0-rt5-g66834e17536e #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x6f/0xb0
     print_circular_bug.cold+0x178/0x1be
     check_noncircular+0x14e/0x170
     check_prev_add+0xeb/0xd80
     __lock_acquire+0x113e/0x15b0
     lock_acquire+0xc5/0x300
     ? rt_spin_lock+0x6d/0x100
     _raw_spin_lock_irqsave+0x3c/0x60
     ? rt_spin_lock+0x6d/0x100
     rt_spin_lock+0x6d/0x100
     __bpf_ringbuf_reserve+0x5a/0xf0
     bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
     bpf_trace_run4+0xae/0x1e0
     __schedule+0x42c/0xca0
     schedule_idle+0x20/0x40
     cpu_startup_entry+0x29/0x30
     start_secondary+0xfa/0x100
     common_startup_64+0x13e/0x148
     </TASK>
    CPU: 1 PID: 160 Comm: screen Not tainted 6.9.0-rt5-g66834e17536e #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x6f/0xb0
     __might_resched.cold+0xcc/0xdf
     rt_spin_lock+0x4c/0x100
     ? __bpf_ringbuf_reserve+0x5a/0xf0
     __bpf_ringbuf_reserve+0x5a/0xf0
     bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
     bpf_trace_run4+0xae/0x1e0
     __schedule+0x42c/0xca0
     preempt_schedule_notrace+0x37/0x60
     preempt_schedule_notrace_thunk+0x1a/0x30
     ? __flush_work+0x84/0x480
     rcu_is_watching+0x32/0x40
     __flush_work+0x30b/0x480
     n_tty_poll+0x131/0x1d0
     tty_poll+0x54/0x90
     do_select+0x490/0x9b0
     ? __bfs+0x136/0x230
     ? do_select+0x26d/0x9b0
     ? __pfx_pollwake+0x10/0x10
     ? __pfx_pollwake+0x10/0x10
     ? core_sys_select+0x238/0x620
     core_sys_select+0x238/0x620
     kern_select+0x101/0x190
     __x64_sys_select+0x21/0x30
     do_syscall_64+0xbc/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

The BPF program in question is attached to sched_switch. The issue seems
to be similar to a couple of syzkaller reports [1], [2], although the
latter one is about nested progs, which seems to be not the case here.
Talking about nested progs, applying a similar approach as in [3]
reworked for bpf_ringbuf, elliminates the issue.

Do I miss anything, is it a known issue? Any ideas how to address that?

[1]: https://lore.kernel.org/all/0000000000000656bf061a429057@xxxxxxxxxx/
[2]: https://lore.kernel.org/lkml/0000000000004aa700061379547e@xxxxxxxxxx/
[3]: https://lore.kernel.org/bpf/20240514124052.1240266-2-sidchintamaneni@xxxxxxxxx/