Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Tue, 15 Dec 2020 18:10:08 -0800

On Tue, Dec 15, 2020 at 5:14 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Dec 15, 2020 at 04:22:21PM -0800, Cong Wang wrote:
> > On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> > >
> > > On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > >>
> > > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > > >> <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >>>
> > > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > >>>>
> > > >>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > > >>>>
> > > >>>> This borrows the idea from conntrack and will be used for conntrack in
> > > >>>> bpf too. Each element in a timeout map has a user-specified timeout
> > > >>>> in secs, after it expires it will be automatically removed from the map.
> > > [...]
> > > >>>>          char key[] __aligned(8);
> > > >>>>   };
> > > >>>>
> > > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> > > >>>>
> > > >>>>          for (i = 0; i < htab->n_buckets; i++) {
> > > >>>>                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> > > >>>> +               atomic_set(&htab->buckets[i].pending, 0);
> > > >>>>                  if (htab_use_raw_lock(htab)) {
> > > >>>>                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
> > > >>>>                          lockdep_set_class(&htab->buckets[i].raw_lock,
> > > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> > > >>>>          return 0;
> > > >>>>   }
> > > >>>>
> > > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> > > >>>> +{
> > > >>>> +       if (atomic_fetch_or(1, &b->pending))
> > > >>>> +               return;
> > > >>>> +       llist_add(&b->gc_node, &htab->gc_list);
> > > >>>> +       queue_work(system_unbound_wq, &htab->gc_work);
> > > >>>> +}
> > > >>>
> > > >>> I'm concerned about each bucket being scheduled individually... And
> > > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> > > >>> scheduling independently. Can you think about the way to have a
> > > >>> "global" gc/purging logic, and just make sure that buckets that need
> > > >>> processing would be just internally chained together. So the purging
> > > >>> routing would iterate all the scheduled hashmaps, and within each it
> > > >>> will have a linked list of buckets that need processing? And all that
> > > >>> is done just once each GC period. Not N times for N maps or N*M times
> > > >>> for N maps with M buckets in each.
> > > >>
> > > >> Our internal discussion went to the opposite actually, people here argued
> > > >> one work is not sufficient for a hashtable because there would be millions
> > > >> of entries (max_entries, which is also number of buckets). ;)
> > > >
> > > > I was hoping that it's possible to expire elements without iterating
> > > > the entire hash table every single time, only items that need to be
> > > > processed. Hashed timing wheel is one way to do something like this,
> > > > kernel has to solve similar problems with timeouts as well, why not
> > > > taking inspiration there?
> > >
> > > Couldn't this map be coupled with LRU map for example through flag on map
> > > creation so that the different LRU map flavors can be used with it? For BPF
> > > CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> > > wonder if for that case you then still need to schedule a GC at all.. e.g.
> > > if you hit the condition time_after_eq64(now, entry->expires) you'd just
> > > re-link the expired element from the public htab to e.g. the LRU's local
> > > CPU's free/pending-list instead.
> >
> > I doubt we can use size as a limit to kick off GC or LRU, it must be
> > time-based. And in case of idle, there has to be an async GC, right?
>
> Why does it have to be time based?

Because it is how a session timeouts? For instance, CT uses
nf_conntrack_udp_timeout to timeout UDP sessions. Or are we going
to redefine conntrack?

> Why LRU alone is not enough?
> People implemented conntracker in bpf using LRU map.

Sure, people also implement CT on native hash map too and timeout
with user-space timers. ;)

> Anything extra can be added on top from user space
> which can easily copy with 1 sec granularity.

The problem is never about granularity, it is about how efficient we can
GC. User-space has to scan the whole table one by one, while the kernel
can just do this behind the scene with a much lower overhead.

Let's say we arm a timer for each entry in user-space, it requires a syscall
and locking buckets each time for each entry. Kernel could do it without
any additional syscall and batching. Like I said above, we could have
millions of entries, so the overhead would be big in this scenario.

> Say the kernel does GC and deletes htab entries.
> How user space will know that it's gone? There would need to be

By a lookup.

> an event sent to user space when entry is being deleted by the kernel.
> But then such event will be racy. Instead when timers and expirations
> are done by user space everything is in sync.

Why there has to be an event?

Thanks.