Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

Benjamin Tissoires <benjamin.tissoires@xxxxxxxxxx> · Wed, 27 Mar 2024 18:02:09 +0100

On Mon, Mar 25, 2024 at 1:50 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Mar 22, 2024 at 7:56 AM Benjamin Tissoires <bentiss@xxxxxxxxxx> wrote:
> >
> > They are implemented as a workqueue, which means that there are no
> > guarantees of timing nor ordering.
> >
> > Signed-off-by: Benjamin Tissoires <bentiss@xxxxxxxxxx>
> >
> > ---
> >
> > no changes in v5
> >
> > changes in v4:
> > - dropped __bpf_timer_compute_key()
> > - use a spin_lock instead of a semaphore
> > - ensure bpf_timer_cancel_and_free is not complaining about
> >   non sleepable context and use cancel_work() instead of
> >   cancel_work_sync()
> > - return -EINVAL if a delay is given to bpf_timer_start() with
> >   BPF_F_TIMER_SLEEPABLE
> >
> > changes in v3:
> > - extracted the implementation in bpf_timer only, without
> >   bpf_timer_set_sleepable_cb()
> > - rely on schedule_work() only, from bpf_timer_start()
> > - add semaphore to ensure bpf_timer_work_cb() is accessing
> >   consistent data
> >
> > changes in v2 (compared to the one attaches to v1 0/9):
> > - make use of a kfunc
> > - add a (non-used) BPF_F_TIMER_SLEEPABLE
> > - the callback is *not* called, it makes the kernel crashes
> > ---
> >  include/uapi/linux/bpf.h |  4 +++
> >  kernel/bpf/helpers.c     | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 88 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 3c42b9f1bada..b90def29d796 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7461,10 +7461,14 @@ struct bpf_core_relo {
> >   *     - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
> >   *       relative to current time.
> >   *     - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
> > + *     - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
> > + *       no guarantees of ordering nor timing (consider this as being just
> > + *       offloaded immediately).
> >   */
> >  enum {
> >         BPF_F_TIMER_ABS = (1ULL << 0),
> >         BPF_F_TIMER_CPU_PIN = (1ULL << 1),
> > +       BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
> >  };
> >
> >  /* BPF numbers iterator state */
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index a89587859571..38de73a9df83 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1094,14 +1094,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
> >   * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
> >   * Inner maps can contain bpf timers as well. ops->map_release_uref is
> >   * freeing the timers when inner map is replaced or deleted by user space.
> > + *
> > + * sleepable_lock protects only the setup of the workqueue, not the callback
> > + * itself. This is done to ensure we don't run concurrently a free of the
> > + * callback or the associated program.
>
> I recall there was a discussion about this lock earlier,
> but I don't remember what the conclusion was.

There wasn't much of a conclusion TBH.

> The above comment is not enough to understand what it protects.
>
> In general how sleepable cb is fundamentally different
> from non-sleepable one when it comes to races ?

I think there are 2 main differences:
- it's sleepable, so classic RCU locking doesn't work (I didn't know
about rcu_read_lock_trace() up to now)
- when we cancel(_and_free) the program, we can not afford to wait for
the program to finish because that program might take ages to do so.

While OTOH, hrtimer callbacks are in soft IRQ, so with IRQ disabled,
and nothing can interrupt it AFAICT. We can also wait for the timer cb
to finish in that case because it can't sleep.

>
> bpf_timer_set_callback() is racy for both sleepable and non-sleepable
> and the latter handles it fine.

I don't think bpf_timer_set_callback() is the problem: in both cases
(sleepable or not) we are under the spinlock from bpf_timer_kern so
the race is cut short there.

>
> Note that struct bpf_hrtimer is rcu protected.
> See kfree_rcu(t, rcu); in bpf_timer_cancel_and_free().

Sorry, RCU is still always hard to grasp for me, and even if I think I
get it, I still don't see how this would be sufficient in sleepable
bpf_timer_work_cb() without any lock protecting the struct bpf_hrtimer
very first access.

>
> >   */
> >  struct bpf_hrtimer {
> >         struct hrtimer timer;
> > +       struct work_struct work;
> >         struct bpf_map *map;
> >         struct bpf_prog *prog;
> >         void __rcu *callback_fn;
> >         void *value;
> >         struct rcu_head rcu;
> > +       spinlock_t sleepable_lock;
> >  };
> >
> >  /* the actual struct hidden inside uapi struct bpf_timer */
> > @@ -1114,6 +1120,49 @@ struct bpf_timer_kern {
> >         struct bpf_spin_lock lock;
> >  } __attribute__((aligned(8)));
> >
> > +static void bpf_timer_work_cb(struct work_struct *work)
> > +{
> > +       struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
> > +       struct bpf_map *map = t->map;
> > +       bpf_callback_t callback_fn;
> > +       void *value = t->value;
> > +       unsigned long flags;
> > +       void *key;
> > +       u32 idx;
> > +
> > +       BTF_TYPE_EMIT(struct bpf_timer);
> > +
> > +       spin_lock_irqsave(&t->sleepable_lock, flags);
> > +
> > +       callback_fn = READ_ONCE(t->callback_fn);
> > +       if (!callback_fn) {
> > +               spin_unlock_irqrestore(&t->sleepable_lock, flags);
> > +               return;
> > +       }
> > +
> > +       if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> > +               struct bpf_array *array = container_of(map, struct bpf_array, map);
> > +
> > +               /* compute the key */
> > +               idx = ((char *)value - array->value) / array->elem_size;
> > +               key = &idx;
> > +       } else { /* hash or lru */
> > +               key = value - round_up(map->key_size, 8);
> > +       }
> > +
> > +       /* prevent the callback to be freed by bpf_timer_cancel() while running
> > +        * so we can release the sleepable lock
> > +        */
> > +       bpf_prog_inc(t->prog);
> > +
> > +       spin_unlock_irqrestore(&t->sleepable_lock, flags);
>
> why prog_inc ?
> The sleepable progs need rcu_read_lock_trace() + migrate_disable()
> anyway, which are missing here.
> Probably best to call __bpf_prog_enter_sleepable_recur()
> like kern_sys_bpf() does.

Sounds like a good idea.

But as I was playing with it, I realized that t->prog is not RCU
protected, so I have no guarantees that the value is correct while
calling __bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)...

Should I manually call first rcu_read_lock_trace() before
__bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)?

>
> Now with that, the bpf_timer_cancel() can drop prog refcnt to zero
> and it's ok, since rcu_read_lock_trace() will protect it.

OK, this is a good step forward, thanks!

>
> > +
> > +       callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
> > +       /* The verifier checked that return value is zero. */
>
> the prog will finish and will be freed after rcu_read_unlock_trace().
> Seems fine to me. No need for inc/dec refcnt.

Ack

>
> > +
> > +       bpf_prog_put(t->prog);
> > +}
> > +
> >  static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> >
> >  static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> > @@ -1192,6 +1241,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >         t->prog = NULL;
> >         rcu_assign_pointer(t->callback_fn, NULL);
> >         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> > +       INIT_WORK(&t->work, bpf_timer_work_cb);
> > +       spin_lock_init(&t->sleepable_lock);
> >         t->timer.function = bpf_timer_cb;
> >         WRITE_ONCE(timer->timer, t);
> >         /* Guarantee the order between timer->timer and map->usercnt. So
> > @@ -1237,6 +1288,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> >                 ret = -EINVAL;
> >                 goto out;
> >         }
> > +       spin_lock(&t->sleepable_lock);
> >         if (!atomic64_read(&t->map->usercnt)) {
> >                 /* maps with timers must be either held by user space
> >                  * or pinned in bpffs. Otherwise timer might still be
> > @@ -1263,6 +1315,8 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> >         }
> >         rcu_assign_pointer(t->callback_fn, callback_fn);
> >  out:
> > +       if (t)
> > +               spin_unlock(&t->sleepable_lock);
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
>
> If lock is really needed why timer->lock cannot be reused?
> The pattern of two locks in pretty much the same data structure
> is begging for questions about what is going on here.

Agree, but I can't find a way to reuse timer->lock:
- ideally I should add struct work_struct into struct bpf_timer_kern
directly, but there is a warning about the size of bpf_timer_kern
which makes me feel like we can not extend it
- adding a pointer back from struct bpf_hrtimer to bpf_timer_kern is
also not a solution as we might be freed if we are outside of the lock
in bpf_timer_kern...

Though if I have reliable access from bpf_timer_work_cb() to the
matching bpf_timer_kern, I could spinlock ->lock while I need to
access ->timer, and then everything would be much easier.

>
> >         return ret;
> >  }
> > @@ -1283,8 +1337,12 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> >
> >         if (in_nmi())
> >                 return -EOPNOTSUPP;
> > -       if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
> > +       if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
> >                 return -EINVAL;
> > +
> > +       if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
> > +               return -EINVAL;
> > +
> >         __bpf_spin_lock_irqsave(&timer->lock);
> >         t = timer->timer;
> >         if (!t || !t->prog) {
> > @@ -1300,7 +1358,10 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> >         if (flags & BPF_F_TIMER_CPU_PIN)
> >                 mode |= HRTIMER_MODE_PINNED;
> >
> > -       hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> > +       if (flags & BPF_F_TIMER_SLEEPABLE)
> > +               schedule_work(&t->work);
> > +       else
> > +               hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> >  out:
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
> >         return ret;
> > @@ -1348,13 +1409,22 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> >                 ret = -EDEADLK;
> >                 goto out;
> >         }
> > +       spin_lock(&t->sleepable_lock);
> >         drop_prog_refcnt(t);
> > +       spin_unlock(&t->sleepable_lock);
>
> this also looks odd.

I basically need to protect "t->prog = NULL;" from happening while
bpf_timer_work_cb is setting up the bpf program to be run.

>
> >  out:
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
> >         /* Cancel the timer and wait for associated callback to finish
> >          * if it was running.
> >          */
> >         ret = ret ?: hrtimer_cancel(&t->timer);
> > +
> > +       /* also cancel the sleepable work, but *do not* wait for
> > +        * it to finish if it was running as we might not be in a
> > +        * sleepable context
> > +        */
> > +       ret = ret ?: cancel_work(&t->work);
> > +
> >         rcu_read_unlock();
> >         return ret;
> >  }
> > @@ -1383,11 +1453,13 @@ void bpf_timer_cancel_and_free(void *val)
> >         t = timer->timer;
> >         if (!t)
> >                 goto out;
> > +       spin_lock(&t->sleepable_lock);
> >         drop_prog_refcnt(t);
> >         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> >          * this timer, since it won't be initialized.
> >          */
> >         WRITE_ONCE(timer->timer, NULL);
> > +       spin_unlock(&t->sleepable_lock);
>
> This one I don't understand either.

Same as above, I do not want t->prog to be set to NULL.

Also, side note: if anyone feels like it would go faster to fix those
races by themself instead of teaching me how to properly do it, this
is definitely fine from me :)

Cheers,
Benjamin