On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 12/15/20 11:03 PM, Andrii Nakryiko wrote: > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > >> > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko > >> <andrii.nakryiko@xxxxxxxxx> wrote: > >>> > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > >>>> > >>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > >>>> > >>>> This borrows the idea from conntrack and will be used for conntrack in > >>>> bpf too. Each element in a timeout map has a user-specified timeout > >>>> in secs, after it expires it will be automatically removed from the map. > [...] > >>>> char key[] __aligned(8); > >>>> }; > >>>> > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab) > >>>> > >>>> for (i = 0; i < htab->n_buckets; i++) { > >>>> INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); > >>>> + atomic_set(&htab->buckets[i].pending, 0); > >>>> if (htab_use_raw_lock(htab)) { > >>>> raw_spin_lock_init(&htab->buckets[i].raw_lock); > >>>> lockdep_set_class(&htab->buckets[i].raw_lock, > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr) > >>>> return 0; > >>>> } > >>>> > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b) > >>>> +{ > >>>> + if (atomic_fetch_or(1, &b->pending)) > >>>> + return; > >>>> + llist_add(&b->gc_node, &htab->gc_list); > >>>> + queue_work(system_unbound_wq, &htab->gc_work); > >>>> +} > >>> > >>> I'm concerned about each bucket being scheduled individually... And > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own > >>> scheduling independently. Can you think about the way to have a > >>> "global" gc/purging logic, and just make sure that buckets that need > >>> processing would be just internally chained together. So the purging > >>> routing would iterate all the scheduled hashmaps, and within each it > >>> will have a linked list of buckets that need processing? And all that > >>> is done just once each GC period. Not N times for N maps or N*M times > >>> for N maps with M buckets in each. > >> > >> Our internal discussion went to the opposite actually, people here argued > >> one work is not sufficient for a hashtable because there would be millions > >> of entries (max_entries, which is also number of buckets). ;) > > > > I was hoping that it's possible to expire elements without iterating > > the entire hash table every single time, only items that need to be > > processed. Hashed timing wheel is one way to do something like this, > > kernel has to solve similar problems with timeouts as well, why not > > taking inspiration there? > > Couldn't this map be coupled with LRU map for example through flag on map > creation so that the different LRU map flavors can be used with it? For BPF > CT use case we do rely on LRU map to purge 'inactive' entries once full. I > wonder if for that case you then still need to schedule a GC at all.. e.g. > if you hit the condition time_after_eq64(now, entry->expires) you'd just > re-link the expired element from the public htab to e.g. the LRU's local > CPU's free/pending-list instead. I doubt we can use size as a limit to kick off GC or LRU, it must be time-based. And in case of idle, there has to be an async GC, right? Thanks.