Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure

Anton Protopopov <aspsk@xxxxxxxxxxxxx> · Fri, 2 Jun 2023 14:21:21 +0000

On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@xxxxxxxxxxxxx> wrote:
> > > >
> > > > LRU logic doesn't kick in until the map is full.
> > >
> > > In fact, it can: a reproducable example is in the self-test from this patch
> > > series. In the test N threads try to insert random values for keys 1..3000
> > > simultaneously. As the result, the map may contain any number of elements,
> > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > So a user can't really even closely estimate the number of elements in the LRU
> > > map based on the number of updates (with unique keys). A per-cpu counter
> > > inc/dec'ed from the kernel side would solve this.
> >
> > That's odd and unexpected.
> > Definitely something to investigate and fix in the LRU map.
> >
> > Pls cc Martin in the future.
> >
> > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > >
> > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > without a map being full. I will step back with this patch until we investigate
> > > if we can replace LRUs with hashes.
> > >
> > > Thanks for the comments!
> 
> Thinking about it more...
> since you're proposing to use percpu counter unconditionally for prealloc
> and percpu_counter_add_batch() logic is batched,
> it could actually be acceptable if it's paired with non-api access.
> Like another patch can add generic kfunc to do __percpu_counter_sum()
> and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> for maps can be extended to print the element count, so the user can have
> convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> 
> But additional logic of percpu_counter_add_batch() might get in the way
> of debugging eventually.
> If we want to have stats then we can have normal per-cpu u32 in basic
> struct bpf_map that most maps, except array, will inc/dec on update/delete.
> kfunc to iterate over percpu is still necessary.
> This way we will be able to see not only number of elements, but detect
> bad usage when one cpu is only adding and another cpu is deleting elements.
> And other cpu misbalance.

This looks for me like two different things: one is a kfunc to get the current
counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
detailed stats (e.g., per-cpu values or more).

My patch, slightly modified, addresses the first goal: most maps of interest
already have a counter in some form (sometimes just atomic_t or u64+lock). If
we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
the new kfunc can get the counter based on the map type.

If/when there's need to provide per-cpu statistics of elements or some more
sophisticated statistics, this can be done without changing the api of the
bpf_map_elements_count() kfunc.

Would this work?

> but debugging and stats is a slippery slope. These simple stats won't be
> enough and people will be tempted to add more and more.
> So I agree that there is a need for bpf map observability,
> but it is not clear whether hard coded stats is the solution.