Re: [PATCH bpf-next v2 4/5] bpf: Cancel the running bpf_timer through kworker

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 14 Jan 2025 17:36:44 -0800

On Wed, Jan 8, 2025 at 10:07 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> From: Hou Tao <houtao1@xxxxxxxxxx>
>
> During the update procedure, when overwrite element in a pre-allocated
> htab, the freeing of old_element is protected by the bucket lock. The
> reason why the bucket lock is necessary is that the old_element has
> already been stashed in htab->extra_elems after alloc_htab_elem()
> returns. If freeing the old_element after the bucket lock is unlocked,
> the stashed element may be reused by concurrent update procedure and the
> freeing of old_element will run concurrently with the reuse of the
> old_element. However, the invocation of check_and_free_fields() may
> acquire a spin-lock which violates the lockdep rule because its caller
> has already held a raw-spin-lock (bucket lock). The following warning
> will be reported when such race happens:
>
>   BUG: scheduling while atomic: test_progs/676/0x00000003
>   3 locks held by test_progs/676:
>   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
>   #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
>   #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
>   Modules linked in: bpf_testmod(O)
>   Preemption disabled at:
>   [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
>   CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
>   Tainted: [W]=WARN, [O]=OOT_MODULE
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
>   Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x70
>   dump_stack+0x10/0x20
>   __schedule_bug+0x120/0x170
>   __schedule+0x300c/0x4800
>   schedule_rtlock+0x37/0x60
>   rtlock_slowlock_locked+0x6d9/0x54c0
>   rt_spin_lock+0x168/0x230
>   hrtimer_cancel_wait_running+0xe9/0x1b0

Since this issue is limited to RT

#ifdef CONFIG_PREEMPT_RT
void hrtimer_cancel_wait_running(const struct hrtimer *timer);
#else
static inline void hrtimer_cancel_wait_running(struct hrtimer *timer)
{
        cpu_relax();
}
#endif

let's avoid overloading system_unbound_wq in !RT.

> -       if (this_cpu_read(hrtimer_running))
> -               queue_work(system_unbound_wq, &t->cb.delete_work);
> -       else
> -               bpf_timer_delete_work(&t->cb.delete_work);

keep this sync call on !RT,

> +       if (!this_cpu_read(hrtimer_running) && hrtimer_try_to_cancel(&t->timer) >= 0) {
> +               kfree_rcu(t, cb.rcu);
> +               return;
> +       }
> +
> +       /* The timer is running on current or other CPU. Use a kworker to wait
> +        * for the completion of the timer instead of spinning on current CPU
> +        * or trying to acquire a sleepable lock to wait for its completion.
> +        */
> +       queue_work(system_unbound_wq, &t->cb.delete_work);

overloading system_unbound_wq even in !RT when hrtimer_running
is an issue to solve as well,
but let's not double down on the problem to avoid making things worse.

The rest looks good.
pw-bot: cr