Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Wed, 16 Dec 2020 22:29:28 -0800

On Wed, Dec 16, 2020 at 10:35 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Tue, Dec 15, 2020 at 4:15 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > On Tue, Dec 15, 2020 at 2:08 PM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > > > > >
> > > > > > This borrows the idea from conntrack and will be used for conntrack in
> > > > > > bpf too. Each element in a timeout map has a user-specified timeout
> > > > > > in secs, after it expires it will be automatically removed from the map.
> > > > > >
> > > > > > There are two cases here:
> > > > > >
> > > > > > 1. When the timeout map is idle, that is, no one updates or accesses it,
> > > > > >    we rely on the idle work to scan the whole hash table and remove
> > > > > >    these expired. The idle work is scheduled every 1 sec.
> > > > >
> > > > > Would 1 second be a good period for a lot of cases? Probably would be
> > > > > good to expand on what went into this decision.
> > > >
> > > > Sure, because our granularity is 1 sec, I will add it into changelog.
> > > >
> > >
> > > Granularity of a timeout is not that coupled with the period of
> > > garbage collection. In this case, with 1 second period, you can have
> > > some items not garbage collected for up to 2 seconds due to timing and
> > > races. Just keep that in mind.
> >
> > Well, it is. Let's say we add entries every ms and kick gc every sec, we
> > could end up with thousands of expired entries in hash map, which could
> > be a problem under memory pressure.
>
> You can have the same thousands of entries expired with 1 second
> timeout granularity, so not sure what point you are making. Think

It is impossible to have expired entries within 1 sec when the granularity
is 1 sec and GC interval is 1 sec (which is my current code).

> about entries being added 1 every millisecond with 1 second timeout.
> So at time +1ms you have 1 message with timeout at +1001ms, at +2ms,
> you have 2 messages, one expiring at +1001ms and another at +1002ms.
> So when you 1 second period GC kicks in at, say, +1000ms, it discards
> nothing. By the time it kicks in second time at +2000ms, you are going
> to expire 1000items, but you could have expired 500 at +1500ms, if
> your period was 500ms, for example. With a 200ms period, you'd have at
> most 200 not expired entries.
>
> This is why I'm saying granularity of timeout units is not coupled
> with the GC period. I hope this makes it clearer. More granular
> timeout units give more flexibility and power to users without
> changing anything else.

The point is the smaller the granularity is, the more entries could be
expired within a GC schedule interval. This is an issue when we have
a burst within an interval and it would cause memory pressure during this
interval.

>
> But relying purely on GC period is bad, because with more frequent
> updates you can accumulate almost arbitrary number of expired entries
> between two GC passes. No matter the timeout granularity.

True, this is why xt_hashlimit simply lets users pick the gc interval. And
in fact, my initial implementation of timeout map exposed gc interval to
user too, I removed it when I learned the granularity can be just 1 sec
for conntrack use case (see Documentation/networking/nf_conntrack-sysctl.txt).

Anyway, it is not a simple task to just convert sec to ms here, the gc
interval matters more when the granularity becomes smaller.

> > >
> > > >
> > > > >
> > > > > >  enum {
> > > > > >         BPF_ANY         = 0, /* create new element or update existing */
> > > > > >         BPF_NOEXIST     = 1, /* create new element if it didn't exist */
> > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > > > index f0b7b54fa3a8..178cb376c397 100644
> > > > > > --- a/kernel/bpf/hashtab.c
> > > > > > +++ b/kernel/bpf/hashtab.c
> > > > > > @@ -8,6 +8,8 @@
> > > > > >  #include <linux/filter.h>
> > > > > >  #include <linux/rculist_nulls.h>
> > > > > >  #include <linux/random.h>
> > > > > > +#include <linux/llist.h>
> > > > > > +#include <linux/workqueue.h>
> > > > > >  #include <uapi/linux/btf.h>
> > > > > >  #include <linux/rcupdate_trace.h>
> > > > > >  #include "percpu_freelist.h"
> > > > > > @@ -84,6 +86,8 @@ struct bucket {
> > > > > >                 raw_spinlock_t raw_lock;
> > > > > >                 spinlock_t     lock;
> > > > > >         };
> > > > > > +       struct llist_node gc_node;
> > > > > > +       atomic_t pending;
> > > > >
> > > > > HASH is an extremely frequently used type of map, and oftentimes with
> > > > > a lot of entries/buckets. I don't think users of normal
> > > > > BPF_MAP_TYPE_HASH should pay the price of way more niche hashmap with
> > > > > timeouts. So I think it's not appropriate to increase the size of the
> > > > > struct bucket here.
> > > >
> > > > I understand that, but what's a better way to do this? I can wrap it up
> > > > on top of struct bucket for sure, but it would need to change a lot of code.
> > > > So, basically code reuse vs. struct bucket size increase. ;)
> > >
> > > I think not paying potentially lots of memory for unused features
> > > wins. Some struct embedding might work. Or just better code reuse.
> > > Please think this through, don't wait for me to write the code for
> > > you.
> >
> > I perfectly understand this point, but other reviewers could easily argue
> > why not just reuse the existing hashmap code given they are pretty much
> > similar.
> >
> > I personally have no problem duplicating the code, but I need to justify it,
> > right? :-/
>
> Minimize duplication of the code, no one said copy/paste all the code.
> But memory bloat is a real problem and should be justification enough
> to at least consider other options.

Sure, I have no problem with this. The question is how do we balance?
Is rewriting 200 lines of code to save 8 bytes of each entry acceptable?
What about rewriting 2000 lines of code? Do people prefer to review 200
or 2000 (or whatever number) lines of code? Or people just want a
minimal change for easier reviews?

>
> [...]
>
> >
> > >
> > > >
> > > > Similarly, please suggest how to expand struct htab_elem without changing
> > > > a lot of code. I also tried to find some hole in the struct, but I
> > > > couldn't, so I
> > > > ran out of ideas here.
> > >
> > > I mentioned above, you can have your own struct and embed htab_elem
> > > inside. It might need some refactoring, of course.
> >
> > So increasing 8 bytes of struct htab_elem is a solid reason to change
> > _potentially_ all of the hash map code? It does not sound solid to me,
> > at least it is arguable.
>
> 8 bytes for htab_elem and 16 bytes for bucket (which equals
> max_entries). Solid enough for me. But I certainly hope that not all
> of the hashmap code would need to be changed.

I can try, but I have to say the worst case is I have to duplicate most the
hashmap lookup/update code. I'd estimate few hundreds more lines of
code. And I want to make sure this is also acceptable to reviewers, in case
of wasting my time.

>
> >
> > I also doubt I could really wrap up on top of htab_elem, as it assumes
> > key and value are stored at the end. And these structs are internal,
> > it is really hard to factor out.
>
> I didn't do the exercise of trying to implement this, so discussing
> this is a bit meaningless at this time. But
>
> struct htab_elem_timeout {
>   ... my timeout related stuff ...
>   struct htab_elem elem;
> };
>
> would preserve that property.

Sure, but you know once the type changes, literally all the code has
to be changed. We can not just pass timeout_elem->elem to
htab_map_update_elem() as it is just internal.

>
>
> >
> > >
> > > >
> > > > >
> > > > > >         char key[] __aligned(8);
> > > > > >  };
> > > > > >
> > > > > > @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> > > > > >
> > > > > >         for (i = 0; i < htab->n_buckets; i++) {
> > > > > >                 INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> > > > > > +               atomic_set(&htab->buckets[i].pending, 0);
> > > > > >                 if (htab_use_raw_lock(htab)) {
> > > > > >                         raw_spin_lock_init(&htab->buckets[i].raw_lock);
> > > > > >                         lockdep_set_class(&htab->buckets[i].raw_lock,
> > > > > > @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> > > > > > +{
> > > > > > +       if (atomic_fetch_or(1, &b->pending))
> > > > > > +               return;
> > > > > > +       llist_add(&b->gc_node, &htab->gc_list);
> > > > > > +       queue_work(system_unbound_wq, &htab->gc_work);
> > > > > > +}
> > > > >
> > > > > I'm concerned about each bucket being scheduled individually... And
> > > > > similarly concerned that each instance of TIMEOUT_HASH will do its own
> > > > > scheduling independently. Can you think about the way to have a
> > > > > "global" gc/purging logic, and just make sure that buckets that need
> > > > > processing would be just internally chained together. So the purging
> > > > > routing would iterate all the scheduled hashmaps, and within each it
> > > > > will have a linked list of buckets that need processing? And all that
> > > > > is done just once each GC period. Not N times for N maps or N*M times
> > > > > for N maps with M buckets in each.
> > > >
> > > > Our internal discussion went to the opposite actually, people here argued
> > > > one work is not sufficient for a hashtable because there would be millions
> > > > of entries (max_entries, which is also number of buckets). ;)
> > >
> > > I was hoping that it's possible to expire elements without iterating
> > > the entire hash table every single time, only items that need to be
> > > processed. Hashed timing wheel is one way to do something like this,
> >
> > How could we know which ones are expired without scanning the
> > whole table? They are clearly not sorted even within a bucket. Sorting
> > them with expiration? Slightly better, as we can just stop at the first
> > non-expired but with an expense of slowing down the update path.
>
> Have you looked up "hashed timing wheel"?

I thought you mean the timer wheel in Linux kernel, I can not immediately
see how it can be used for our GC here. I guess you suggest to to link
each entry based on expiration sec?

>
> >
> > > kernel has to solve similar problems with timeouts as well, why not
> > > taking inspiration there?
> >
> > Mind to point out which similar problems in the kernel?
> >
> > If you mean inspiration from conntrack, it is even worse, it uses multiple
> > locking and locks on fast path too. I also looked at xt_hashlimit, it is not
> > any better either.
>
> I was thinking about epoll timeouts, but I don't know all the
> implementation details, of course. My point was that kernel solves the
> problem of maintaining a lot of uncorrelated timeouts already (epoll,
> timeout signals, etc).

They possibly just use a timer or delayed work etc.. This is why I only
look at similar timeout hash maps.

Thanks!