Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Thu, 17 Dec 2020 23:39:49 +0100




On 12/16/20 1:22 AM, Cong Wang wrote:
On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:

On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:

On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:

On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:

From: Cong Wang <cong.wang@xxxxxxxxxxxxx>

This borrows the idea from conntrack and will be used for conntrack in
bpf too. Each element in a timeout map has a user-specified timeout
in secs, after it expires it will be automatically removed from the map.
[...]
          char key[] __aligned(8);
   };

@@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)

          for (i = 0; i < htab->n_buckets; i++) {
                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
+               atomic_set(&htab->buckets[i].pending, 0);
                  if (htab_use_raw_lock(htab)) {
                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
                          lockdep_set_class(&htab->buckets[i].raw_lock,
@@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
          return 0;
   }

+static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
+{
+       if (atomic_fetch_or(1, &b->pending))
+               return;
+       llist_add(&b->gc_node, &htab->gc_list);
+       queue_work(system_unbound_wq, &htab->gc_work);
+}

I'm concerned about each bucket being scheduled individually... And
similarly concerned that each instance of TIMEOUT_HASH will do its own
scheduling independently. Can you think about the way to have a
"global" gc/purging logic, and just make sure that buckets that need
processing would be just internally chained together. So the purging
routing would iterate all the scheduled hashmaps, and within each it
will have a linked list of buckets that need processing? And all that
is done just once each GC period. Not N times for N maps or N*M times
for N maps with M buckets in each.

Our internal discussion went to the opposite actually, people here argued
one work is not sufficient for a hashtable because there would be millions
of entries (max_entries, which is also number of buckets). ;)

I was hoping that it's possible to expire elements without iterating
the entire hash table every single time, only items that need to be
processed. Hashed timing wheel is one way to do something like this,
kernel has to solve similar problems with timeouts as well, why not
taking inspiration there?

Couldn't this map be coupled with LRU map for example through flag on map
creation so that the different LRU map flavors can be used with it? For BPF
CT use case we do rely on LRU map to purge 'inactive' entries once full. I
wonder if for that case you then still need to schedule a GC at all.. e.g.
if you hit the condition time_after_eq64(now, entry->expires) you'd just
re-link the expired element from the public htab to e.g. the LRU's local
CPU's free/pending-list instead.

I doubt we can use size as a limit to kick off GC or LRU, it must be
time-based. And in case of idle, there has to be an async GC, right?

I was thinking no GC at all, meaning, above mentioned re-linking of expired
elements would be done lazily e.g. whenever we walk a given bucket (e.g. on
lookup/update/delete) under the assumption we don't have deep lists there to
keep the time comparison not too expensive and that element migration has low
overhead (e.g. move to local CPU free-list).