Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 4 Apr 2024 09:40:48 -0700

On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
<benjamin.tissoires@xxxxxxxxxx> wrote:
>
>
> >
> > So we need something like:
> >
> > struct bpf_hrtimer {
> >   union {
> >     struct hrtimer timer;
> > +   struct work_struct work;
> >   };
> >   struct bpf_map *map;
> >   struct bpf_prog *prog;
> >   void __rcu *callback_fn;
> >   void *value;
> >   union {
>
> Are you sure we need an union here? If we get to call kfree_rcu() we
> need to have both struct rcu_head and sync_work, not one or the other.

why? with an extra flag it's one or the other.
In bpf_timer_cancel_and_free()
if (flag & SLEEPABLE) {
    schedule_work() to cancel_work_sync + kfree_rcu
} else {
   hrtimer_cancel
   kfree_rcu
}

> >     struct rcu_head rcu;
> > +   struct work_struct sync_work;
> >   };
> > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
>
> If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> (like in my v2 or v3 IIRC). But that means that once a timer is
> initialized it needs to be of one or the other type (especially true
> with the first union in this struct).

yes. That's an idea.
The code to support wq vs timer seems to be diverging more
than what we expected initially.
It seems cleaner to set it as init time and enforce in
other helpers.

Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
too far.
It's already at 112 bytes and some people use bpf_timer per flow.
So potentially millions of such timers.
Adding extra sizeof(struct work_struct)=32 * 2 that won't be
used is too much.
Note that sizeof(struct hrtimer)=64, so unions make everything
fit nicely.

> So should we reject during run time bpf_timer_set_callback() for
> sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> those? (and the invert in the other case).

yes.

> This version of the patch allows for one timer to be used as softIRQ
> or WQ, depending on the timer_set_callback that is used. But it might
> be simpler for the kfree_rcu race to define the bpf_timer to be of one
> kind, so we are sure to call the correct kfree method.

I think one or another simplifies the code and makes it easier
to think through combinations.

I'm still contemplating adding new "struct bpf_wq" and new kfuncs
to completely separate wq vs timer.
The code reuse seems to be relatively small.
We can potentially factor out internals of bpf_timer_* into smaller
helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.

One more thing.
bpf_timer_cancel() api turned out to be troublesome.
Not only it cancels the timer, but drops callback too.
It was a surprising behavior for people familiar with
kernel timer api-s.
We should not repeat this mistake with wq.

We can try to fix bpf_timer_cancel() too.
If we drop drop_prog_refcnt() from it it shouldn't affect
existing bpf_timer users who are forced to do:
bpf_timer_cancel()
bpf_timer_set_callback()
bpf_timer_start()
all the time.
If/when bpf_timer_cancel() stops dropping the callback
such bpf prog won't be affected. So low chance of breaking any prog.
wdyt?