On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@xxxxxx> wrote: > > > + ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map, > > > + (u64)(long)key, > > > + (u64)(long)t->value, 0, 0); > > > + WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */ > > > > I didn't find that next patch disallows callback return value 1 in the > > verifier. If we indeed disallows return value 1 in the verifier. We > > don't need WARN_ON here. Did I miss anything? > > Ohh. I forgot to address this bit in the verifier. Will fix. > > > > + if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)) > > > + /* If the timer wasn't active or callback already executing > > > + * bump the prog refcnt to keep it alive until > > > + * callback is invoked (again). > > > + */ > > > + bpf_prog_inc(t->prog); > > > > I am not 100% sure. But could we have race condition here? > > cpu 1: running bpf_timer_start() helper call > > cpu 2: doing hrtimer work (calling callback etc.) > > > > Is it possible that > > !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer) > > may be true and then right before bpf_prog_inc(t->prog), it becomes > > true? If hrtimer_callback_running() is called, it is possible that > > callback function could have dropped the reference count for t->prog, > > so we could already go into the body of the function > > __bpf_prog_put()? > > you're correct. Indeed there is a race. > Circular dependency is a never ending headache. > That's the same design mistake as with tail_calls. > It felt that this case would be simpler than tail_calls and a bpf program > pinning itself with bpf_prog_inc can be made to work... nope. > I'll get rid of this and switch to something 'obviously correct'. > Probably a link list with a lock to keep a set of init-ed timers and > auto-cancel them on prog refcnt going to zero. > To do 'bpf daemon' the prog would need to be pinned. Hm.. wouldn't this eliminate that race: switch (hrtimer_try_to_cancel(&t->timer)) { case 0: /* nothing was queued */ bpf_prog_inc(t->prog); break; case 1: /* already have refcnt and it won't be bpf_prog_put by callback */ break; case -1: /* callback is running and will bpf_prog_put, so we need to take another refcnt */ bpf_prog_inc(t->prog); break; } hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT); So instead of guessing (racily) whether there is a queued callback or not, try to cancel just in case there is. Then rely on the nice guarantees that hrtimer cancellation API provides. Reading a bit more of hrtimer API, I'm more concerned now with the per-cpu variable (hrtimer_running). Seems like the timer can get migrated from one CPU to another, so all the auxiliary per-CPU state might get invalidated without us knowing about that. But it's getting late, I'll think about all this a bit more tomorrow with a fresh head. > > > > + if (val) { > > > + /* This restriction will be removed in the next patch */ > > > + verbose(env, "bpf_timer field can only be first in the map value element\n"); > > > + return -EINVAL; > > > + } > > > + WARN_ON(meta->map_ptr); > > > > Could you explain when this could happen? > > Only if there is a verifier bug or new helper is added with arg to timer > and arg to map. I'll switch to verbose() + efault instead.