Re: [RFC v2 PATCH bpf-next 0/4] bpf: add percpu stats for bpf_map

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Mon, 26 Jun 2023 10:50:09 +0200

On 6/24/23 2:17 AM, Alexei Starovoitov wrote:
On Fri, Jun 23, 2023 at 2:53 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
On 6/22/23 11:53 AM, Anton Protopopov wrote:
This series adds a mechanism for maps to populate per-cpu counters of elements
on insertions/deletions. The sum of these counters can be accessed by a new
kfunc from a map iterator program.

The following patches are present in the series:

    * Patch 1 adds a generic per-cpu counter to struct bpf_map
    * Patch 2 utilizes this mechanism for hash-based maps
    * Patch 3 extends the preloaded map iterator to dump the sum
    * Patch 4 adds a self-test for the change

The reason for adding this functionality in our case (Cilium) is to get
signals about how full some heavy-used maps are and what the actual dynamic
profile of map capacity is. In the case of LRU maps this is impossible to get
this information anyhow else. See also [1].

This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@xxxxxxxxxxxxx/T/#t
It was rewritten according to previous comments.  I've turned this series into
an RFC for two reasons:

1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64.
For systems which might write s64 non-atomically this would be required to use
some locking mechanism to prevent readers from reading trash via the
bpf_map_sum_elements_counter() kfunc (see patch 1)

2) In comparison with the v1, we're adding extra instructions per map operation
(for preallocated maps, as well as for non-preallocated maps). The only
functionality we're interested at the moment is the number of elements present
in a map, not a per-cpu statistics. This could be better achieved by using
the v1 version, which only adds computations for preallocated maps.

So, the question is: won't it be fine to do the changes in the following way:

    * extend the preallocated hash maps to populate percpu batch counters as in v1
    * add a kfunc as in v2 to get the current sum

This works as

    * nobody at the moment actually requires the per-cpu statistcs
    * this implementation can be transparently turned into per-cpu statistics, if
      such a need occurs on practice (the only thing to change would be to
      re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats)
    * the "v1 way" is the least intrusive: it only affects preallocated maps, as
      other maps already provide the required functionality

    [1] https://lpc.events/event/16/contributions/1368/

v1 -> v2:
- make the counters generic part of struct bpf_map
- don't use map_info and /proc/self/fdinfo in favor of a kfunc

Tbh, I did like v1 approach a bit better. We are trying to bend over backwards just
so that we don't add things to uapi, but in the end we are also adding it to the
maps.debug, etc (yes, it has .debug in the name and all) ...

I think we should keep bending even more backwards to avoid uapi burden.
All new features should be non-uapi no matter how many people
say "I'll definitely use it".

or as an extensible bpf_map_info stats extension in case there is some agreement?

I'd rather not.
bpf_map_info returns what user space sent to the kernel earlier.
stats or anything that user space didn't explicitly give earlier
is quite different.
Same goes for bpf_prog_info.
We only added verified_insns there that doesn't fit the above definition
and it was a mistake. After almost 2 years it is still unused
and cannot be removed.

Right, hence my question also on how others observe maps in production to see
if there are similar use cases and approaches as we have.

veristat is parsing non-uapi verifier log.
Tooling can live with non-uapi map stats just fine.

Sure, that's okay.

Thanks,
Daniel