Re: [PATCH RFC bpf-next v2 02/10] bpf/helpers: introduce sleepable timers

Benjamin Tissoires <bentiss@xxxxxxxxxx> · Fri, 16 Feb 2024 15:58:20 +0100

On Feb 16 2024, Toke Høiland-Jørgensen wrote:
> Benjamin Tissoires <bentiss@xxxxxxxxxx> writes:
> 
> > On Feb 15 2024, Martin KaFai Lau wrote:
> >> On 2/14/24 9:18 AM, Benjamin Tissoires wrote:
> >> > +static void bpf_timer_work_cb(struct work_struct *work)
> >> > +{
> >> > +	struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
> >> > +	struct bpf_map *map = t->map;
> >> > +	void *value = t->value;
> >> > +	bpf_callback_t callback_fn;
> >> > +	void *key;
> >> > +	u32 idx;
> >> > +
> >> > +	BTF_TYPE_EMIT(struct bpf_timer);
> >> > +
> >> > +	rcu_read_lock();
> >> > +	callback_fn = rcu_dereference(t->sleepable_cb_fn);
> >> > +	rcu_read_unlock();
> >> 
> >> I took a very brief look at patch 2. One thing that may worth to ask here,
> >> the rcu_read_unlock() seems to be done too early. It is protecting the
> >> t->sleepable_cb_fn (?), so should it be done after finished using the
> >> callback_fn?
> >
> > Probably :)
> >
> > TBH, everytime I work with RCUs I spent countless hours trying to
> > re-understand everything, and in this case I'm currently in the "let's
> > make it work" process than fixing concurrency issues.
> > I still gave it a shot in case it solves my issue, but no, I still have
> > the crash.
> >
> > But given that callback_fn might sleep, isn't it an issue to keep the
> > RCU_reader lock so long? (we don't seem to call synchronize_rcu() so it
> > might be fine, but I'd like the confirmation from someone else).
> 
> You're right, it isn't. From the RCU/checklist.rst doc:
> 
> 13.	Unlike most flavors of RCU, it *is* permissible to block in an
> 	SRCU read-side critical section (demarked by srcu_read_lock()
> 	and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
> 	Please note that if you don't need to sleep in read-side critical
> 	sections, you should be using RCU rather than SRCU, because RCU
> 	is almost always faster and easier to use than is SRCU.
> 
> So we can't use the regular RCU protection for the callback in this
> usage. We'll need to either convert it to SRCU, or add another
> protection mechanism to make sure the callback function is not freed
> from under us (like a refcnt). I suspect the latter may be simpler (from
> reading the rest of that documentation around SRCU.

Currently I'm thinking at also incrementing the ->prog held in the
bpf_hrtimer which should prevent the callback to be freed, if I'm not wrong.
Then I should be able to just release the rcu_read_unlock before calling
the actual callback. And then put the ref on ->prog once done.

But to be able to do that I might need to protect ->prog with an RCU
too.

> 
> >> A high level design question. The intention of the new
> >> bpf_timer_set_sleepable_cb() kfunc is actually to delay work to a workqueue.
> >> It is useful to delay work from the bpf_timer_cb and it may also useful to
> >> delay work from other bpf running context (e.g. the networking hooks like
> >> "tc"). The bpf_timer_set_sleepable_cb() seems to be unnecessary forcing
> >> delay-work must be done in a bpf_timer_cb.
> >
> > Basically I'm just a monkey here. I've been told that I should use
> > bpf_timer[0]. But my implementation is not finished, as Alexei mentioned
> > that we should bypass hrtimer if I'm not wrong [1].
> 
> I don't think getting rid of the hrtimer in favour of
> schedule_delayed_work() makes any sense. schedule_delayed_work() does
> exactly the same as you're doing in this version of the patch: it
> schedules a timer callback, and calls queue_work() from inside that
> timer callback. It just uses "regular" timers instead of hrtimers. So I
> don't think there's any performance benefit from using that facility; on
> the contrary, it would require extra logic to handle cancellation etc;
> might as well just re-use the existing hrtimer-based callback logic we
> already have, and do a schedule_work() from the hrtimer callback like
> you're doing now.

I agree that we can nicely emulate delayed_timer with the current patch
series. However, if I understand Alexei's idea (and Martin's) there are
cases where we just want schedule_work(), without any timer involved.
That makes a weird timer (with a delay always equal to 0), but it would
allow to satisfy those latency issues.

So (and this also answers your second email today) I'm thinking at:
- have multiple flags to control the timer (with dedicated timer_cb
  kernel functions):
  - BPF_F_TIMER_HRTIMER (default)
  - BPF_F_TIMER_WORKER (no timer, just workqueue)
  - BPF_F_TIMER_DELAYED_WORKER (hrtimer + workqueue, or actual
    delayed_work, but that's re-implementing stuffs)
- have bpf_timer_set_callback() and bpf_timer_set_sleepable_cb() strictly
  equivalent in terms of processing, but the latter just instructs the
  verifier that the callback can be sleepable (so calling
  bpf_timer_set_callback() on a BPF_F_TIMER_DELAYED_WORKER is fine as
  long as the target callback is using non sleepable kfuncs).
- ensure that bpf_timer_set_sleepable_cb() is invalid when called with
  a non sleepable timer flag.
- ensure that when the bpf_timer has no hrtimer we also return -EINVAL
  when a delay is given in bpf_timer_start().

Actually, BPF_F_TIMER_DELAYED_WORKER could be just a combination of
(BPF_F_TIMER_HRTIMER | BPF_F_TIMER_WORKER) which would reflect the
reality of how things are implemented.

Thinking through this a little bit more, maybe we should have
BPF_F_TIMER_SLEEPABLE instead of BPF_F_TIMER_WORKER. We can still have
BPF_F_TIMER_SLEEPABLE_BH when WQ_BH gets merged. _SLEEPABLE allows to
not bind the flag to the implementation...

Cheers,
Benjamin