On Wed, Jan 27, 2021 at 10:00 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > ret = PTR_ERR(l_new); > > > > + if (ret == -EAGAIN) { > > > > + htab_unlock_bucket(htab, b, hash, flags); > > > > + htab_gc_elem(htab, l_old); > > > > + mod_delayed_work(system_unbound_wq, &htab->gc_work, 0); > > > > + goto again; > > > > > > Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting > > > in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress? > > > > In this case, the old one is scheduled for removal in GC, we just wait for GC > > to finally remove it. It won't stall unless GC itself or the worker scheduler is > > wrong, both of which should be kernel bugs. > > > > If we don't do this, users would get a -E2BIG when it is not too big. I don't > > know a better way to handle this sad situation, maybe returning -EBUSY > > to users and let them call again? > > I think using wq for timers is a non-starter. > Tying a hash/lru map with a timer is not a good idea either. Both xt_hashlimit and nf_conntrack_core use delayed/deferrable works, probably since their beginnings. They seem to have started well. ;) > > I think timers have to be done as independent objects similar to > how the kernel uses them. > Then there will be no question whether lru or hash map needs it. Yeah, this probably could make the code easier, but when we have millions of entries in a map, millions of timers would certainly bring a lot of CPU overhead (timer interrupt storm?). > The bpf prog author will be able to use timers with either. > The prog will be able to use timers without hash maps too. > > I'm proposing a timer map where each object will go through > bpf_timer_setup(timer, callback, flags); > where "callback" is a bpf subprogram. > Corresponding bpf_del_timer and bpf_mod_timer would work the same way > they are in the kernel. > The tricky part is kernel style of using from_timer() to access the > object with additional info. > I think bpf timer map can model it the same way. > At map creation time the value_size will specify the amount of extra > bytes necessary. > Another alternative is to pass an extra data argument to a callback. Hmm, this idea is very interesting. I still think arming a timer, whether a kernel timer or a bpf timer, for each entry is overkill, but we can arm one for each map, something like: bpf_timer_run(interval, bpf_prog, &any_map); so we run 'bpf_prog' on any map every 'interval', but the 'bpf_prog' would have to iterate the whole map during each interval to delete the expired ones. This is probably doable: the timestamps can be stored either as a part of key or value, and bpf_jiffies64() is already available, users would have to discard expired ones after lookup when they are faster than the timer GC. Let me take a deeper look tomorrow. Thanks.