On 6/22/23 11:53 AM, Anton Protopopov wrote:
This series adds a mechanism for maps to populate per-cpu counters of elements on insertions/deletions. The sum of these counters can be accessed by a new kfunc from a map iterator program. The following patches are present in the series: * Patch 1 adds a generic per-cpu counter to struct bpf_map * Patch 2 utilizes this mechanism for hash-based maps * Patch 3 extends the preloaded map iterator to dump the sum * Patch 4 adds a self-test for the change The reason for adding this functionality in our case (Cilium) is to get signals about how full some heavy-used maps are and what the actual dynamic profile of map capacity is. In the case of LRU maps this is impossible to get this information anyhow else. See also [1]. This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@xxxxxxxxxxxxx/T/#t It was rewritten according to previous comments. I've turned this series into an RFC for two reasons: 1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64. For systems which might write s64 non-atomically this would be required to use some locking mechanism to prevent readers from reading trash via the bpf_map_sum_elements_counter() kfunc (see patch 1) 2) In comparison with the v1, we're adding extra instructions per map operation (for preallocated maps, as well as for non-preallocated maps). The only functionality we're interested at the moment is the number of elements present in a map, not a per-cpu statistics. This could be better achieved by using the v1 version, which only adds computations for preallocated maps. So, the question is: won't it be fine to do the changes in the following way: * extend the preallocated hash maps to populate percpu batch counters as in v1 * add a kfunc as in v2 to get the current sum This works as * nobody at the moment actually requires the per-cpu statistcs * this implementation can be transparently turned into per-cpu statistics, if such a need occurs on practice (the only thing to change would be to re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats) * the "v1 way" is the least intrusive: it only affects preallocated maps, as other maps already provide the required functionality [1] https://lpc.events/event/16/contributions/1368/ v1 -> v2: - make the counters generic part of struct bpf_map - don't use map_info and /proc/self/fdinfo in favor of a kfunc
Tbh, I did like v1 approach a bit better. We are trying to bend over backwards just so that we don't add things to uapi, but in the end we are also adding it to the maps.debug, etc (yes, it has .debug in the name and all) ... > for maps can be extended to print the element count, so the user can have > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. ... I feel with convenience access, some folks in the wild might build apps parsing it in the end instead of writing iterators. I would be curious to hear from Google and Meta folks in terms of how you observe map fillup or other useful map related stats in production (unless you don't watch out for this?), and if there is a common denominator we could come up with that would be really useful for all parties and thus potentially make sense for fdinfo or as an extensible bpf_map_info stats extension in case there is some agreement? In our case we don't really care about exact numbers, just more around whether a user is scaling up their k8s cluster to the point where the previous configurations on sizing e.g. around LB mappings etc would hit the limits in the foreseeable future if they dial up further. This is exported to dashboards (e.g. grafana) for visibility so operators can know ahead of time. The iterator approach is okay, less convenient, but if there is indeed nothing common from other production users which helped map visibility over the years, then so be it. I thought to at least ask given we have this thread. :) Thanks, Daniel