On Wed, Dec 16, 2020 at 10:35 AM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Tue, Dec 15, 2020 at 4:15 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > On Tue, Dec 15, 2020 at 2:08 PM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > > > > > On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko > > > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > > > > > On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > > > > > > > > > From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > > > > > > > > > > > This borrows the idea from conntrack and will be used for conntrack in > > > > > > bpf too. Each element in a timeout map has a user-specified timeout > > > > > > in secs, after it expires it will be automatically removed from the map. > > > > > > > > > > > > There are two cases here: > > > > > > > > > > > > 1. When the timeout map is idle, that is, no one updates or accesses it, > > > > > > we rely on the idle work to scan the whole hash table and remove > > > > > > these expired. The idle work is scheduled every 1 sec. > > > > > > > > > > Would 1 second be a good period for a lot of cases? Probably would be > > > > > good to expand on what went into this decision. > > > > > > > > Sure, because our granularity is 1 sec, I will add it into changelog. > > > > > > > > > > Granularity of a timeout is not that coupled with the period of > > > garbage collection. In this case, with 1 second period, you can have > > > some items not garbage collected for up to 2 seconds due to timing and > > > races. Just keep that in mind. > > > > Well, it is. Let's say we add entries every ms and kick gc every sec, we > > could end up with thousands of expired entries in hash map, which could > > be a problem under memory pressure. > > You can have the same thousands of entries expired with 1 second > timeout granularity, so not sure what point you are making. Think It is impossible to have expired entries within 1 sec when the granularity is 1 sec and GC interval is 1 sec (which is my current code). > about entries being added 1 every millisecond with 1 second timeout. > So at time +1ms you have 1 message with timeout at +1001ms, at +2ms, > you have 2 messages, one expiring at +1001ms and another at +1002ms. > So when you 1 second period GC kicks in at, say, +1000ms, it discards > nothing. By the time it kicks in second time at +2000ms, you are going > to expire 1000items, but you could have expired 500 at +1500ms, if > your period was 500ms, for example. With a 200ms period, you'd have at > most 200 not expired entries. > > This is why I'm saying granularity of timeout units is not coupled > with the GC period. I hope this makes it clearer. More granular > timeout units give more flexibility and power to users without > changing anything else. The point is the smaller the granularity is, the more entries could be expired within a GC schedule interval. This is an issue when we have a burst within an interval and it would cause memory pressure during this interval. > > But relying purely on GC period is bad, because with more frequent > updates you can accumulate almost arbitrary number of expired entries > between two GC passes. No matter the timeout granularity. True, this is why xt_hashlimit simply lets users pick the gc interval. And in fact, my initial implementation of timeout map exposed gc interval to user too, I removed it when I learned the granularity can be just 1 sec for conntrack use case (see Documentation/networking/nf_conntrack-sysctl.txt). Anyway, it is not a simple task to just convert sec to ms here, the gc interval matters more when the granularity becomes smaller. > > > > > > > > > > > > > > > > > > enum { > > > > > > BPF_ANY = 0, /* create new element or update existing */ > > > > > > BPF_NOEXIST = 1, /* create new element if it didn't exist */ > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > > > > > index f0b7b54fa3a8..178cb376c397 100644 > > > > > > --- a/kernel/bpf/hashtab.c > > > > > > +++ b/kernel/bpf/hashtab.c > > > > > > @@ -8,6 +8,8 @@ > > > > > > #include <linux/filter.h> > > > > > > #include <linux/rculist_nulls.h> > > > > > > #include <linux/random.h> > > > > > > +#include <linux/llist.h> > > > > > > +#include <linux/workqueue.h> > > > > > > #include <uapi/linux/btf.h> > > > > > > #include <linux/rcupdate_trace.h> > > > > > > #include "percpu_freelist.h" > > > > > > @@ -84,6 +86,8 @@ struct bucket { > > > > > > raw_spinlock_t raw_lock; > > > > > > spinlock_t lock; > > > > > > }; > > > > > > + struct llist_node gc_node; > > > > > > + atomic_t pending; > > > > > > > > > > HASH is an extremely frequently used type of map, and oftentimes with > > > > > a lot of entries/buckets. I don't think users of normal > > > > > BPF_MAP_TYPE_HASH should pay the price of way more niche hashmap with > > > > > timeouts. So I think it's not appropriate to increase the size of the > > > > > struct bucket here. > > > > > > > > I understand that, but what's a better way to do this? I can wrap it up > > > > on top of struct bucket for sure, but it would need to change a lot of code. > > > > So, basically code reuse vs. struct bucket size increase. ;) > > > > > > I think not paying potentially lots of memory for unused features > > > wins. Some struct embedding might work. Or just better code reuse. > > > Please think this through, don't wait for me to write the code for > > > you. > > > > I perfectly understand this point, but other reviewers could easily argue > > why not just reuse the existing hashmap code given they are pretty much > > similar. > > > > I personally have no problem duplicating the code, but I need to justify it, > > right? :-/ > > Minimize duplication of the code, no one said copy/paste all the code. > But memory bloat is a real problem and should be justification enough > to at least consider other options. Sure, I have no problem with this. The question is how do we balance? Is rewriting 200 lines of code to save 8 bytes of each entry acceptable? What about rewriting 2000 lines of code? Do people prefer to review 200 or 2000 (or whatever number) lines of code? Or people just want a minimal change for easier reviews? > > [...] > > > > > > > > > > > > > > Similarly, please suggest how to expand struct htab_elem without changing > > > > a lot of code. I also tried to find some hole in the struct, but I > > > > couldn't, so I > > > > ran out of ideas here. > > > > > > I mentioned above, you can have your own struct and embed htab_elem > > > inside. It might need some refactoring, of course. > > > > So increasing 8 bytes of struct htab_elem is a solid reason to change > > _potentially_ all of the hash map code? It does not sound solid to me, > > at least it is arguable. > > 8 bytes for htab_elem and 16 bytes for bucket (which equals > max_entries). Solid enough for me. But I certainly hope that not all > of the hashmap code would need to be changed. I can try, but I have to say the worst case is I have to duplicate most the hashmap lookup/update code. I'd estimate few hundreds more lines of code. And I want to make sure this is also acceptable to reviewers, in case of wasting my time. > > > > > I also doubt I could really wrap up on top of htab_elem, as it assumes > > key and value are stored at the end. And these structs are internal, > > it is really hard to factor out. > > I didn't do the exercise of trying to implement this, so discussing > this is a bit meaningless at this time. But > > struct htab_elem_timeout { > ... my timeout related stuff ... > struct htab_elem elem; > }; > > would preserve that property. Sure, but you know once the type changes, literally all the code has to be changed. We can not just pass timeout_elem->elem to htab_map_update_elem() as it is just internal. > > > > > > > > > > > > > > > > > > > > > > char key[] __aligned(8); > > > > > > }; > > > > > > > > > > > > @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab) > > > > > > > > > > > > for (i = 0; i < htab->n_buckets; i++) { > > > > > > INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); > > > > > > + atomic_set(&htab->buckets[i].pending, 0); > > > > > > if (htab_use_raw_lock(htab)) { > > > > > > raw_spin_lock_init(&htab->buckets[i].raw_lock); > > > > > > lockdep_set_class(&htab->buckets[i].raw_lock, > > > > > > @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr) > > > > > > return 0; > > > > > > } > > > > > > > > > > > > +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b) > > > > > > +{ > > > > > > + if (atomic_fetch_or(1, &b->pending)) > > > > > > + return; > > > > > > + llist_add(&b->gc_node, &htab->gc_list); > > > > > > + queue_work(system_unbound_wq, &htab->gc_work); > > > > > > +} > > > > > > > > > > I'm concerned about each bucket being scheduled individually... And > > > > > similarly concerned that each instance of TIMEOUT_HASH will do its own > > > > > scheduling independently. Can you think about the way to have a > > > > > "global" gc/purging logic, and just make sure that buckets that need > > > > > processing would be just internally chained together. So the purging > > > > > routing would iterate all the scheduled hashmaps, and within each it > > > > > will have a linked list of buckets that need processing? And all that > > > > > is done just once each GC period. Not N times for N maps or N*M times > > > > > for N maps with M buckets in each. > > > > > > > > Our internal discussion went to the opposite actually, people here argued > > > > one work is not sufficient for a hashtable because there would be millions > > > > of entries (max_entries, which is also number of buckets). ;) > > > > > > I was hoping that it's possible to expire elements without iterating > > > the entire hash table every single time, only items that need to be > > > processed. Hashed timing wheel is one way to do something like this, > > > > How could we know which ones are expired without scanning the > > whole table? They are clearly not sorted even within a bucket. Sorting > > them with expiration? Slightly better, as we can just stop at the first > > non-expired but with an expense of slowing down the update path. > > Have you looked up "hashed timing wheel"? I thought you mean the timer wheel in Linux kernel, I can not immediately see how it can be used for our GC here. I guess you suggest to to link each entry based on expiration sec? > > > > > > kernel has to solve similar problems with timeouts as well, why not > > > taking inspiration there? > > > > Mind to point out which similar problems in the kernel? > > > > If you mean inspiration from conntrack, it is even worse, it uses multiple > > locking and locks on fast path too. I also looked at xt_hashlimit, it is not > > any better either. > > I was thinking about epoll timeouts, but I don't know all the > implementation details, of course. My point was that kernel solves the > problem of maintaining a lot of uncorrelated timeouts already (epoll, > timeout signals, etc). They possibly just use a timer or delayed work etc.. This is why I only look at similar timeout hash maps. Thanks!