Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 15 Dec 2020 17:14:22 -0800

On Tue, Dec 15, 2020 at 04:22:21PM -0800, Cong Wang wrote:
> On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >
> > On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > >>
> > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > >> <andrii.nakryiko@xxxxxxxxx> wrote:
> > >>>
> > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > >>>>
> > >>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > >>>>
> > >>>> This borrows the idea from conntrack and will be used for conntrack in
> > >>>> bpf too. Each element in a timeout map has a user-specified timeout
> > >>>> in secs, after it expires it will be automatically removed from the map.
> > [...]
> > >>>>          char key[] __aligned(8);
> > >>>>   };
> > >>>>
> > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> > >>>>
> > >>>>          for (i = 0; i < htab->n_buckets; i++) {
> > >>>>                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> > >>>> +               atomic_set(&htab->buckets[i].pending, 0);
> > >>>>                  if (htab_use_raw_lock(htab)) {
> > >>>>                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
> > >>>>                          lockdep_set_class(&htab->buckets[i].raw_lock,
> > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> > >>>>          return 0;
> > >>>>   }
> > >>>>
> > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> > >>>> +{
> > >>>> +       if (atomic_fetch_or(1, &b->pending))
> > >>>> +               return;
> > >>>> +       llist_add(&b->gc_node, &htab->gc_list);
> > >>>> +       queue_work(system_unbound_wq, &htab->gc_work);
> > >>>> +}
> > >>>
> > >>> I'm concerned about each bucket being scheduled individually... And
> > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> > >>> scheduling independently. Can you think about the way to have a
> > >>> "global" gc/purging logic, and just make sure that buckets that need
> > >>> processing would be just internally chained together. So the purging
> > >>> routing would iterate all the scheduled hashmaps, and within each it
> > >>> will have a linked list of buckets that need processing? And all that
> > >>> is done just once each GC period. Not N times for N maps or N*M times
> > >>> for N maps with M buckets in each.
> > >>
> > >> Our internal discussion went to the opposite actually, people here argued
> > >> one work is not sufficient for a hashtable because there would be millions
> > >> of entries (max_entries, which is also number of buckets). ;)
> > >
> > > I was hoping that it's possible to expire elements without iterating
> > > the entire hash table every single time, only items that need to be
> > > processed. Hashed timing wheel is one way to do something like this,
> > > kernel has to solve similar problems with timeouts as well, why not
> > > taking inspiration there?
> >
> > Couldn't this map be coupled with LRU map for example through flag on map
> > creation so that the different LRU map flavors can be used with it? For BPF
> > CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> > wonder if for that case you then still need to schedule a GC at all.. e.g.
> > if you hit the condition time_after_eq64(now, entry->expires) you'd just
> > re-link the expired element from the public htab to e.g. the LRU's local
> > CPU's free/pending-list instead.
> 
> I doubt we can use size as a limit to kick off GC or LRU, it must be
> time-based. And in case of idle, there has to be an async GC, right?

Why does it have to be time based?
Why LRU alone is not enough?
People implemented conntracker in bpf using LRU map.
Anything extra can be added on top from user space
which can easily copy with 1 sec granularity.
Say the kernel does GC and deletes htab entries.
How user space will know that it's gone? There would need to be
an event sent to user space when entry is being deleted by the kernel.
But then such event will be racy. Instead when timers and expirations
are done by user space everything is in sync.