On Feb 16 2024, Toke Høiland-Jørgensen wrote: > Benjamin Tissoires <bentiss@xxxxxxxxxx> writes: > > > On Feb 15 2024, Martin KaFai Lau wrote: > >> On 2/14/24 9:18 AM, Benjamin Tissoires wrote: > >> > +static void bpf_timer_work_cb(struct work_struct *work) > >> > +{ > >> > + struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work); > >> > + struct bpf_map *map = t->map; > >> > + void *value = t->value; > >> > + bpf_callback_t callback_fn; > >> > + void *key; > >> > + u32 idx; > >> > + > >> > + BTF_TYPE_EMIT(struct bpf_timer); > >> > + > >> > + rcu_read_lock(); > >> > + callback_fn = rcu_dereference(t->sleepable_cb_fn); > >> > + rcu_read_unlock(); > >> > >> I took a very brief look at patch 2. One thing that may worth to ask here, > >> the rcu_read_unlock() seems to be done too early. It is protecting the > >> t->sleepable_cb_fn (?), so should it be done after finished using the > >> callback_fn? > > > > Probably :) > > > > TBH, everytime I work with RCUs I spent countless hours trying to > > re-understand everything, and in this case I'm currently in the "let's > > make it work" process than fixing concurrency issues. > > I still gave it a shot in case it solves my issue, but no, I still have > > the crash. > > > > But given that callback_fn might sleep, isn't it an issue to keep the > > RCU_reader lock so long? (we don't seem to call synchronize_rcu() so it > > might be fine, but I'd like the confirmation from someone else). > > You're right, it isn't. From the RCU/checklist.rst doc: > > 13. Unlike most flavors of RCU, it *is* permissible to block in an > SRCU read-side critical section (demarked by srcu_read_lock() > and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". > Please note that if you don't need to sleep in read-side critical > sections, you should be using RCU rather than SRCU, because RCU > is almost always faster and easier to use than is SRCU. > > So we can't use the regular RCU protection for the callback in this > usage. We'll need to either convert it to SRCU, or add another > protection mechanism to make sure the callback function is not freed > from under us (like a refcnt). I suspect the latter may be simpler (from > reading the rest of that documentation around SRCU. Currently I'm thinking at also incrementing the ->prog held in the bpf_hrtimer which should prevent the callback to be freed, if I'm not wrong. Then I should be able to just release the rcu_read_unlock before calling the actual callback. And then put the ref on ->prog once done. But to be able to do that I might need to protect ->prog with an RCU too. > > >> A high level design question. The intention of the new > >> bpf_timer_set_sleepable_cb() kfunc is actually to delay work to a workqueue. > >> It is useful to delay work from the bpf_timer_cb and it may also useful to > >> delay work from other bpf running context (e.g. the networking hooks like > >> "tc"). The bpf_timer_set_sleepable_cb() seems to be unnecessary forcing > >> delay-work must be done in a bpf_timer_cb. > > > > Basically I'm just a monkey here. I've been told that I should use > > bpf_timer[0]. But my implementation is not finished, as Alexei mentioned > > that we should bypass hrtimer if I'm not wrong [1]. > > I don't think getting rid of the hrtimer in favour of > schedule_delayed_work() makes any sense. schedule_delayed_work() does > exactly the same as you're doing in this version of the patch: it > schedules a timer callback, and calls queue_work() from inside that > timer callback. It just uses "regular" timers instead of hrtimers. So I > don't think there's any performance benefit from using that facility; on > the contrary, it would require extra logic to handle cancellation etc; > might as well just re-use the existing hrtimer-based callback logic we > already have, and do a schedule_work() from the hrtimer callback like > you're doing now. I agree that we can nicely emulate delayed_timer with the current patch series. However, if I understand Alexei's idea (and Martin's) there are cases where we just want schedule_work(), without any timer involved. That makes a weird timer (with a delay always equal to 0), but it would allow to satisfy those latency issues. So (and this also answers your second email today) I'm thinking at: - have multiple flags to control the timer (with dedicated timer_cb kernel functions): - BPF_F_TIMER_HRTIMER (default) - BPF_F_TIMER_WORKER (no timer, just workqueue) - BPF_F_TIMER_DELAYED_WORKER (hrtimer + workqueue, or actual delayed_work, but that's re-implementing stuffs) - have bpf_timer_set_callback() and bpf_timer_set_sleepable_cb() strictly equivalent in terms of processing, but the latter just instructs the verifier that the callback can be sleepable (so calling bpf_timer_set_callback() on a BPF_F_TIMER_DELAYED_WORKER is fine as long as the target callback is using non sleepable kfuncs). - ensure that bpf_timer_set_sleepable_cb() is invalid when called with a non sleepable timer flag. - ensure that when the bpf_timer has no hrtimer we also return -EINVAL when a delay is given in bpf_timer_start(). Actually, BPF_F_TIMER_DELAYED_WORKER could be just a combination of (BPF_F_TIMER_HRTIMER | BPF_F_TIMER_WORKER) which would reflect the reality of how things are implemented. Thinking through this a little bit more, maybe we should have BPF_F_TIMER_SLEEPABLE instead of BPF_F_TIMER_WORKER. We can still have BPF_F_TIMER_SLEEPABLE_BH when WQ_BH gets merged. _SLEEPABLE allows to not bind the flag to the implementation... Cheers, Benjamin