Re: [RFC Patch bpf-next] bpf: introduce bpf timer

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Fri, 2 Apr 2021 13:57:29 -0700

On Fri, Apr 2, 2021 at 12:45 PM Song Liu <songliubraving@xxxxxx> wrote:
>
>
>
> > On Apr 2, 2021, at 12:08 PM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > On Fri, Apr 2, 2021 at 10:57 AM Song Liu <songliubraving@xxxxxx> wrote:
> >>
> >>
> >>
> >>> On Apr 2, 2021, at 10:34 AM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>>
> >>> On Thu, Apr 1, 2021 at 1:17 PM Song Liu <songliubraving@xxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Apr 1, 2021, at 10:28 AM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>>>>
> >>>>> On Wed, Mar 31, 2021 at 11:38 PM Song Liu <songliubraving@xxxxxx> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Mar 31, 2021, at 9:26 PM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> >>>>>>>
> >>>>>>> (This patch is still in early stage and obviously incomplete. I am sending
> >>>>>>> it out to get some high-level feedbacks. Please kindly ignore any coding
> >>>>>>> details for now and focus on the design.)
> >>>>>>
> >>>>>> Could you please explain the use case of the timer? Is it the same as
> >>>>>> earlier proposal of BPF_MAP_TYPE_TIMEOUT_HASH?
> >>>>>>
> >>>>>> Assuming that is the case, I guess the use case is to assign an expire
> >>>>>> time for each element in a hash map; and periodically remove expired
> >>>>>> element from the map.
> >>>>>>
> >>>>>> If this is still correct, my next question is: how does this compare
> >>>>>> against a user space timer? Will the user space timer be too slow?
> >>>>>
> >>>>> Yes, as I explained in timeout hashmap patchset, doing it in user-space
> >>>>> would require a lot of syscalls (without batching) or copying (with batching).
> >>>>> I will add the explanation here, in case people miss why we need a timer.
> >>>>
> >>>> How about we use a user space timer to trigger a BPF program (e.g. use
> >>>> BPF_PROG_TEST_RUN on a raw_tp program); then, in the BPF program, we can
> >>>> use bpf_for_each_map_elem and bpf_map_delete_elem to scan and update the
> >>>> map? With this approach, we only need one syscall per period.
> >>>
> >>> Interesting, I didn't know we can explicitly trigger a BPF program running
> >>> from user-space. Is it for testing purposes only?
> >>
> >> This is not only for testing. We will use this in perf (starting in 5.13).
> >>
> >> /* currently in Arnaldo's tree, tools/perf/util/bpf_counter.c: */
> >>
> >> /* trigger the leader program on a cpu */
> >> static int bperf_trigger_reading(int prog_fd, int cpu)
> >> {
> >>        DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
> >>                            .ctx_in = NULL,
> >>                            .ctx_size_in = 0,
> >>                            .flags = BPF_F_TEST_RUN_ON_CPU,
> >>                            .cpu = cpu,
> >>                            .retval = 0,
> >>                );
> >>
> >>        return bpf_prog_test_run_opts(prog_fd, &opts);
> >> }
> >>
> >> test_run also passes return value (retval) back to user space, so we and
> >> adjust the timer interval based on retval.
> >
> > This is really odd, every name here contains a "test" but it is not for testing
> > purposes. You probably need to rename/alias it. ;)
> >
> > So, with this we have to get a user-space daemon running just to keep
> > this "timer" alive. If I want to run it every 1ms, it means I have to issue
> > a syscall BPF_PROG_TEST_RUN every 1ms. Even with a timer fd, we
> > still need poll() and timerfd_settime(). This is a considerable overhead
> > for just a single timer.
>
> sys_bpf() takes about 0.5us. I would expect poll() and timerfd_settime() to
> be slightly faster. So the overhead is less than 0.2% of a single core
> (0.5us x 3 / 1ms). Do we need many counters for conntrack?

This is just for one timer. The whole system may end up with many timers
when we have more and more eBPF programs. So managing the timers
in the use-space would be a problem too someday, clearly one daemon
per-timer does not scale.

>
> >
> > With current design, user-space can just exit after installing the timer,
> > either it can adjust itself or other eBPF code can adjust it, so the per
> > timer overhead is the same as a kernel timer.
>
> I guess we still need to hold a fd to the prog/map? Alternatively, we can
> pin the prog/map, but then the user need to clean it up.

Yes, but I don't see how holding a fd could bring any overhead after
initial setup.

>
> >
> > The visibility to other BPF code is important for the conntrack case,
> > because each time we get an expired item during a lookup, we may
> > want to schedule the GC timer to run sooner. At least this would give
> > users more freedom to decide when to reschedule the timer.
>
> Do we plan to share the timer program among multiple processes (which can
> start and terminate in arbitrary orders)? If that is the case, I can imagine
> a timer program is better than a user space timer.

I mean I want other eBPF program to manage the timers in kernel-space,
as conntrack is mostly in kernel-space. If the timer is only manageable
in user-space, it would seriously limit its use case.

Ideally I even prefer to create timers in kernel-space too, but as I already
explained, this seems impossible to me.

Thanks.