[RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 9 Mar 2022 12:27:15 -0800

Hey everyone,

I would like to discuss an idea to facilitate collection of
hierarchical cgroup stats using BPF programs. We want to provide a
simple interface for BPF programs to collect hierarchical cgroup stats
and integrate with the existing rstat aggregation mechanism in the
kernel. The most prominent use case is the ability to extend memcg
stats (and histograms) by BPF programs.

This also integrates nicely with Hao's work [1] that enables reading
those stats through files, similar to cgroupfs. This idea is more
concerned about the stats collection path.

The main idea is to introduce a new map type (let's call it BPF cgroup
stats map for now). This map will be keyed by cgroup_id (similar to
cgroup storage). The value is an array (or struct, more on this later)
that the user chooses its size and element type, which will hold the
stats. The main properties of the map are as follows:
1. Map entries creation and deletion is handled automatically by the kernel.
2. Internally, the map entries contain per-cpu arrays, a total array,
and a pending array.
3. BPF programs & user space see the entry as a single array, updates
are transparently made to per-cpu array, and lookups invoke stats
flushing.

The main differences between this and a cgroup storage is that it
naturally integrates with rstat hierarchical aggregation (more on that
later). The reason why we do not want to do aggregation in BPF
programs or in user space are:
1. Each program will loop through the cgroup descendants to do their
own stats aggregation, lots of repeated work.
2. We will loop through all the descendants, even those that do not
have updates.

These problems are already addressed by the rstat aggregation
mechanism in the kernel, which is primarily used for memcg stats. We
want to provide a way for BPF programs to be able to make use of this
as well.

The lifetime of map entries can be handled as follows:
- When the map is created, it gets as a parameter an initial
cgroup_id, maybe through the map_extra parameter struct bpf_attr. The
map is created and entries for the initial cgroup and all its
descendants are created.
- The update and delete interfaces are disabled. The kernel creates
entries for new cgroups and removes entries for destroyed cgroups (we
can use cgroup_bpf_inherit() and  cgroup_bpf_release()).
- When all the entries in the map are deleted (initial cgroup
destroyed), the map is destroyed.

The map usage by BPF programs and integration with rstat can be as follows:
- Internally, each map entry has per-cpu arrays, a total array, and a
pending array. BPF programs and user space only see one array.
- The update interface is disabled. BPF programs use helpers to modify
elements. Internally, the modifications are made to per-cpu arrays,
and invoke a call to cgroup_bpf_updated()  or an equivalent.
- Lookups (from BPF programs or user space) invoke an rstat flush and
read from the total array.
- In cgroup_rstat_flush_locked() flush BPF stats as well.

Flushing of BPF stats can be as follows:
- For every cgroup, we will either use flags to distinguish BPF stats
updates from normal stats updates, or flush both anyway (memcg stats
are periodically flushed anyway).
- We will need to link cgroups to the maps that have entries for them.
One possible implementation here is to store the map entries in struct
cgroup_bpf in a htable indexed by map fd. The update helpers will also
use this to avoid lookups.
- For each updated cgroup, we go through all of its maps, accumulate
per-cpu arrays to the total array, then propagate total to the
parent’s pending array (same mechanism as memcg stats flushing).

There is room for extensions or generalizations here:
- Provide flags to enable/disable using per-cpu arrays (for stats that
are not updated frequently), and enable/disable hierarchical
aggregation (for non-hierarchical stats, they can still make benefit
of the automatic entries creation & deletion).
- Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
- Instead of an array as the map value, use a struct, and let the user
provide an aggregator function in the form of a BPF program.

I am happy to hear your thoughts about the idea in general and any
comments or concerns.

[1] https://lwn.net/Articles/886292/