On Tue, Dec 15, 2020 at 5:14 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Dec 15, 2020 at 04:22:21PM -0800, Cong Wang wrote: > > On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > > > > > On 12/15/20 11:03 PM, Andrii Nakryiko wrote: > > > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > >> > > > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko > > > >> <andrii.nakryiko@xxxxxxxxx> wrote: > > > >>> > > > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > >>>> > > > >>>> From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > > >>>> > > > >>>> This borrows the idea from conntrack and will be used for conntrack in > > > >>>> bpf too. Each element in a timeout map has a user-specified timeout > > > >>>> in secs, after it expires it will be automatically removed from the map. > > > [...] > > > >>>> char key[] __aligned(8); > > > >>>> }; > > > >>>> > > > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab) > > > >>>> > > > >>>> for (i = 0; i < htab->n_buckets; i++) { > > > >>>> INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); > > > >>>> + atomic_set(&htab->buckets[i].pending, 0); > > > >>>> if (htab_use_raw_lock(htab)) { > > > >>>> raw_spin_lock_init(&htab->buckets[i].raw_lock); > > > >>>> lockdep_set_class(&htab->buckets[i].raw_lock, > > > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr) > > > >>>> return 0; > > > >>>> } > > > >>>> > > > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b) > > > >>>> +{ > > > >>>> + if (atomic_fetch_or(1, &b->pending)) > > > >>>> + return; > > > >>>> + llist_add(&b->gc_node, &htab->gc_list); > > > >>>> + queue_work(system_unbound_wq, &htab->gc_work); > > > >>>> +} > > > >>> > > > >>> I'm concerned about each bucket being scheduled individually... And > > > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own > > > >>> scheduling independently. Can you think about the way to have a > > > >>> "global" gc/purging logic, and just make sure that buckets that need > > > >>> processing would be just internally chained together. So the purging > > > >>> routing would iterate all the scheduled hashmaps, and within each it > > > >>> will have a linked list of buckets that need processing? And all that > > > >>> is done just once each GC period. Not N times for N maps or N*M times > > > >>> for N maps with M buckets in each. > > > >> > > > >> Our internal discussion went to the opposite actually, people here argued > > > >> one work is not sufficient for a hashtable because there would be millions > > > >> of entries (max_entries, which is also number of buckets). ;) > > > > > > > > I was hoping that it's possible to expire elements without iterating > > > > the entire hash table every single time, only items that need to be > > > > processed. Hashed timing wheel is one way to do something like this, > > > > kernel has to solve similar problems with timeouts as well, why not > > > > taking inspiration there? > > > > > > Couldn't this map be coupled with LRU map for example through flag on map > > > creation so that the different LRU map flavors can be used with it? For BPF > > > CT use case we do rely on LRU map to purge 'inactive' entries once full. I > > > wonder if for that case you then still need to schedule a GC at all.. e.g. > > > if you hit the condition time_after_eq64(now, entry->expires) you'd just > > > re-link the expired element from the public htab to e.g. the LRU's local > > > CPU's free/pending-list instead. > > > > I doubt we can use size as a limit to kick off GC or LRU, it must be > > time-based. And in case of idle, there has to be an async GC, right? > > Why does it have to be time based? Because it is how a session timeouts? For instance, CT uses nf_conntrack_udp_timeout to timeout UDP sessions. Or are we going to redefine conntrack? > Why LRU alone is not enough? > People implemented conntracker in bpf using LRU map. Sure, people also implement CT on native hash map too and timeout with user-space timers. ;) > Anything extra can be added on top from user space > which can easily copy with 1 sec granularity. The problem is never about granularity, it is about how efficient we can GC. User-space has to scan the whole table one by one, while the kernel can just do this behind the scene with a much lower overhead. Let's say we arm a timer for each entry in user-space, it requires a syscall and locking buckets each time for each entry. Kernel could do it without any additional syscall and batching. Like I said above, we could have millions of entries, so the overhead would be big in this scenario. > Say the kernel does GC and deletes htab entries. > How user space will know that it's gone? There would need to be By a lookup. > an event sent to user space when entry is being deleted by the kernel. > But then such event will be racy. Instead when timers and expirations > are done by user space everything is in sync. Why there has to be an event? Thanks.