Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

Benjamin Tissoires <benjamin.tissoires@xxxxxxxxxx> · Fri, 5 Apr 2024 17:46:29 +0200

On Thu, Apr 4, 2024 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires
> <benjamin.tissoires@xxxxxxxxxx> wrote:
> >
> > On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> > > <benjamin.tissoires@xxxxxxxxxx> wrote:
> > > >
> > > >
> > > > >
> > > > > So we need something like:
> > > > >
> > > > > struct bpf_hrtimer {
> > > > >   union {
> > > > >     struct hrtimer timer;
> > > > > +   struct work_struct work;
> > > > >   };
> > > > >   struct bpf_map *map;
> > > > >   struct bpf_prog *prog;
> > > > >   void __rcu *callback_fn;
> > > > >   void *value;
> > > > >   union {
> > > >
> > > > Are you sure we need an union here? If we get to call kfree_rcu() we
> > > > need to have both struct rcu_head and sync_work, not one or the other.
> > >
> > > why? with an extra flag it's one or the other.
> > > In bpf_timer_cancel_and_free()
> > > if (flag & SLEEPABLE) {
> > >     schedule_work() to cancel_work_sync + kfree_rcu
> > > } else {
> > >    hrtimer_cancel
> > >    kfree_rcu
> > > }
> >
> > I thought kfree_rcu required struct rcu_head, and given that we need
> > to initialize sync_work it will be poisoned...
>
> yes. It needs rcu_head.
> But where do you see a conflict?
> INIT_WORK + schedule_work() will use that space,
> then cancel_work_sync() will wait on a different work_struct,
> then kfree_rcu() will reuse that space.

Yeah, sorry, I haven't realized that the memory used by kfree_rcu
wasn't initialized.

>
> In case of hrtimers none of the work_structs will be used.
>
> >
> > >
> > > > >     struct rcu_head rcu;
> > > > > +   struct work_struct sync_work;
> > > > >   };
> > > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> > > >
> > > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > > > (like in my v2 or v3 IIRC). But that means that once a timer is
> > > > initialized it needs to be of one or the other type (especially true
> > > > with the first union in this struct).
> > >
> > > yes. That's an idea.
> > > The code to support wq vs timer seems to be diverging more
> > > than what we expected initially.
> > > It seems cleaner to set it as init time and enforce in
> > > other helpers.
> >
> > OK, works for me.
> >
> > >
> > > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> > > too far.
> > > It's already at 112 bytes and some people use bpf_timer per flow.
> > > So potentially millions of such timers.
> > > Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> > > used is too much.
> > > Note that sizeof(struct hrtimer)=64, so unions make everything
> > > fit nicely.
> >
> > Maybe we should do
> > union {
> >   struct hrtimer timer;
> >   struct {
> >     struct work_struct work;
> >     struct work_struct sync_work;
> >   }
> > }
>
> It's also ok, but sharing rcu_head and work_struct seems
> cleaner, since it highlights that they're exclusive.
>
> > (not nice to read but at least we don't change the size at the beginning)
> >
> > >
> > > > So should we reject during run time bpf_timer_set_callback() for
> > > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > > > those? (and the invert in the other case).
> > >
> > > yes.
> > >
> > > > This version of the patch allows for one timer to be used as softIRQ
> > > > or WQ, depending on the timer_set_callback that is used. But it might
> > > > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > > > kind, so we are sure to call the correct kfree method.
> > >
> > > I think one or another simplifies the code and makes it easier
> > > to think through combinations.
> > >
> > > I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> > > to completely separate wq vs timer.
> > > The code reuse seems to be relatively small.
> >
> > There is some code reuse in the verifier, but it can be factored out I think.
> >
> > Though the biggest reuse might be in the map portion of bpf_timer,
> > which I haven't looked much TBH.
>
> Right. It's all the 'case BPF_TIMER:' in various places.
> New 'struct bpf_wq' would need another entry in btf_field_type.
> But that should be a straightforward addition.
>
> >
> > > We can potentially factor out internals of bpf_timer_* into smaller
> > > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
> >
> > Yeah, also, given that we are going to enforce delay == 0 for
> > sleepable timers (wq), the user api would be much cleaner if we can
> > have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
> > easier to deal with).
>
> It seems so.
> Kinda hard to judge one way or the other without looking at
> the final code, but it seems separation is worth attempting, at least.
>
> Also if we ever do hrtimer+wq we probably will be using
> 'struct delayed_work' instead of rolling our own
> 'struct hrtimer' + 'struct work_struct' combo.
>
> It seems wq logic already made such a combination special enough
> and thought through the races, so we better just follow that path.
> In that case it might be yet another 'struct bpf_delayed_wq'
> and another set of kfuncs.
> Considering that cancel_work() and cancel_delayed_work()
> are separate api in the kernel.
> Multiplexing all of them under bpf_timer_cancel()
> seems wrong.
> In the past we were somewhat limited in terms of helpers.
> We tried not to add them unless absolutely necessary because
> of uapi considerations.
> Now with kfuncs we can add/tweak/remove them at will.
>
> >
> > >
> > > One more thing.
> > > bpf_timer_cancel() api turned out to be troublesome.
> > > Not only it cancels the timer, but drops callback too.
> > > It was a surprising behavior for people familiar with
> > > kernel timer api-s.
> > > We should not repeat this mistake with wq.
> > >
> > > We can try to fix bpf_timer_cancel() too.
> > > If we drop drop_prog_refcnt() from it it shouldn't affect
> > > existing bpf_timer users who are forced to do:
> > > bpf_timer_cancel()
> > > bpf_timer_set_callback()
> > > bpf_timer_start()
> > > all the time.
> > > If/when bpf_timer_cancel() stops dropping the callback
> > > such bpf prog won't be affected. So low chance of breaking any prog.
> > > wdyt?
> > >
> >
> > How would a program know set_callback() is not required after a
> > cancel() because the kernel kept it around? It seems that it's going
> > to be hard for them to know that (unless by trying first a start()),
> > and it will add more code.
> >
> > timer_cancel() would be hard to change but we can always do the change
> > and add a new kfunc timer_cancel_no_drop() which would clearly allow
>
> that works too.
>
> > for new programs to know that set_callback() is not required to be
> > called. In a few kernel releases we could remove it and say that
> > timer_cancel() is the same (and replaced by a #define)
>
> #define won't work, since mechanics of detecting and calling
> helpers vs kfuncs is quite different.
>
> > Anyway, the more I think of it, the more I think the best API would be
> > a dedicated wq API. It's probably going to need a little bit more
> > work, but it'll be more or less this work plus the new bpf_wq type in
> > the map.
>
> It seems to me as well.
>
> Thanks for brainstorming.
>

Alright, as of today (and I'm about to be AFK for the weekend), I got
your changes in and working (I think). I'll review the series on
Monday and send it back so we have a baseline to compare it to with
bpf_wq.

Cheers,
Benjamin