Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Tue, 15 Dec 2020 16:22:21 -0800

On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>
> >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> >> <andrii.nakryiko@xxxxxxxxx> wrote:
> >>>
> >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >>>>
> >>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> >>>>
> >>>> This borrows the idea from conntrack and will be used for conntrack in
> >>>> bpf too. Each element in a timeout map has a user-specified timeout
> >>>> in secs, after it expires it will be automatically removed from the map.
> [...]
> >>>>          char key[] __aligned(8);
> >>>>   };
> >>>>
> >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> >>>>
> >>>>          for (i = 0; i < htab->n_buckets; i++) {
> >>>>                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> >>>> +               atomic_set(&htab->buckets[i].pending, 0);
> >>>>                  if (htab_use_raw_lock(htab)) {
> >>>>                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
> >>>>                          lockdep_set_class(&htab->buckets[i].raw_lock,
> >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> >>>>          return 0;
> >>>>   }
> >>>>
> >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> >>>> +{
> >>>> +       if (atomic_fetch_or(1, &b->pending))
> >>>> +               return;
> >>>> +       llist_add(&b->gc_node, &htab->gc_list);
> >>>> +       queue_work(system_unbound_wq, &htab->gc_work);
> >>>> +}
> >>>
> >>> I'm concerned about each bucket being scheduled individually... And
> >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> >>> scheduling independently. Can you think about the way to have a
> >>> "global" gc/purging logic, and just make sure that buckets that need
> >>> processing would be just internally chained together. So the purging
> >>> routing would iterate all the scheduled hashmaps, and within each it
> >>> will have a linked list of buckets that need processing? And all that
> >>> is done just once each GC period. Not N times for N maps or N*M times
> >>> for N maps with M buckets in each.
> >>
> >> Our internal discussion went to the opposite actually, people here argued
> >> one work is not sufficient for a hashtable because there would be millions
> >> of entries (max_entries, which is also number of buckets). ;)
> >
> > I was hoping that it's possible to expire elements without iterating
> > the entire hash table every single time, only items that need to be
> > processed. Hashed timing wheel is one way to do something like this,
> > kernel has to solve similar problems with timeouts as well, why not
> > taking inspiration there?
>
> Couldn't this map be coupled with LRU map for example through flag on map
> creation so that the different LRU map flavors can be used with it? For BPF
> CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> wonder if for that case you then still need to schedule a GC at all.. e.g.
> if you hit the condition time_after_eq64(now, entry->expires) you'd just
> re-link the expired element from the public htab to e.g. the LRU's local
> CPU's free/pending-list instead.

I doubt we can use size as a limit to kick off GC or LRU, it must be
time-based. And in case of idle, there has to be an async GC, right?

Thanks.