Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 27 Jan 2021 18:45:58 -0800

On Wed, Jan 27, 2021 at 2:48 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 1/27/21 7:00 PM, Alexei Starovoitov wrote:
> > On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>>>                ret = PTR_ERR(l_new);
> >>>> +             if (ret == -EAGAIN) {
> >>>> +                     htab_unlock_bucket(htab, b, hash, flags);
> >>>> +                     htab_gc_elem(htab, l_old);
> >>>> +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> >>>> +                     goto again;
> >>>
> >>> Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> >>> in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
> >>
> >> In this case, the old one is scheduled for removal in GC, we just wait for GC
> >> to finally remove it. It won't stall unless GC itself or the worker scheduler is
> >> wrong, both of which should be kernel bugs.
> >>
> >> If we don't do this, users would get a -E2BIG when it is not too big. I don't
> >> know a better way to handle this sad situation, maybe returning -EBUSY
> >> to users and let them call again?
> >
> > I think using wq for timers is a non-starter.
> > Tying a hash/lru map with a timer is not a good idea either.
>
> Thinking some more, given we have jiffies64 helper and atomic ops for BPF by now,
> we would technically only need the ability to delete entries via bpf iter progs
> (d6c4503cc296 ("bpf: Implement bpf iterator for hash maps")) which could then be
> kicked off from user space at e.g. dynamic intervals which would be the equivalent
> for the wq in here. That patch could then be implemented this way. I presume
> the ability to delete map entries from bpf iter progs would be generic and useful
> enough anyway.

I think bpf_iter for maps doesn't hold bucket lock anymore, so delete of the
same element should be allowed already. Unless I've mistaken wip patches
with landed ones.
Soon there will be support for bpf_map_for_each_elem() iterator
running from the bpf program itself, so even more ways to do GC like work.

> > I think timers have to be done as independent objects similar to
> > how the kernel uses them.
> > Then there will be no question whether lru or hash map needs it.
> > The bpf prog author will be able to use timers with either.
> > The prog will be able to use timers without hash maps too.
> >
> > I'm proposing a timer map where each object will go through
> > bpf_timer_setup(timer, callback, flags);
> > where "callback" is a bpf subprogram.
> > Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> > they are in the kernel.
> > The tricky part is kernel style of using from_timer() to access the
> > object with additional info.
>
> Would this mean N timer objs for N map elems? I presume not given this could be
> racy and would have huge extra mem overhead.

hmm. Not sure I got the point about overhead.
sizeof(struct timer_list) == 40 bytes.
Not a lot of overhead even if there is a timer object per flow.
When bpf_map_for_each_elem() lands the bpf prog could have one
timer that will kick one a second (or whatever GC period the prog author wants)
and does bpf_map_for_each_elem() once callback fires to delete elems
older than whatever interval the prog needs.
imo that would be a true generic way to combine
bpf_maps, bpf_iters and bpf_timers.

> Either way, timer obj could work, but
> then at the same time you could probably also solve it with the above; it's not
> like you need the timer to kick in at some /exact/ time, but rather at some point
> to clean up stale entries before the map gets full and worst case refuses updates
> for new entries. (In the ideal case though we wouldn't need the extra effort to
> search deeply for elements w/o penalizing the fast-path lookup costs too much when
> walking the bucket.)

Right. I wasn't suggesting to mess with the timer object at every lookup.
My understanding that mod_timer() isn't that fast to call a million
times a second.
bpf progs would have to be smart.

Instead of timer objects as a timer map (a collection of timers that
bpf progs can manipulate)
we can introduce them similar to "struct bpf_spin_lock".
Then "struct bpf_timer foo;" field can be embedded inside any map value
(both hash and array). The verifier work would be a bit trickier than
support for bpf_spin_lock,
but certainly doable.

My main point was that bpf should grow with small composable
building blocks instead of single purpose constructs.
I view hashmap+GC as an example of something that should be split
into smaller blocks.