On 8/1/23 7:29 AM, Yafang Shao wrote:
Some statistic data is stored in percpu pointer but the kernel doesn't aggregate it into a single value, for example, the data in struct psi_group_cpu. Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr: for_loop(nr_cpus, callback_fn, callback_ctx, 0) In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr(). The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be rejected by the verifier, hindering deployment, as servers may have different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal. Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask. However, it requires creating a bpf_cpumask, copying the cpumask from the task, and then parsing the CPU IDs from it, resulting in low efficiency. Introducing other kfuncs like bpf_cpumask_next might be necessary. A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse percpu data, covering all scenarios. It includes for_each_{possible, present, online}_cpu. The user can also traverse CPUs from a specific task, such as walking the CPUs of a cpuset cgroup when the task is in that cgroup.
The bpf subsystem has adopted kfunc approach. So there is no bpf helper any more. You need to have a bpf_for_each_cpu kfunc instead. But I am wondering whether we should use open coded iterator loops 06accc8779c1 bpf: add support for open-coded iterator loops In kernel, we have a global variable nr_cpu_ids (also in kernel/bpf/helpers.c) which is used in numerous places for per cpu data struct access. I am wondering whether we could have bpf code like int nr_cpu_ids __ksym; struct bpf_iter_num it; int i = 0; // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. bpf_iter_num_new(&it, 1, nr_cpu_ids); while ((v = bpf_iter_num_next(&it))) { /* access cpu i data */ i++; } bpf_iter_num_destroy(&it); From all existing open coded iterator loops, looks like upper bound has to be a constant. We might need to extend support to bounded scalar upper bound if not there.
In our use case, we utilize this new helper to traverse percpu psi data. This aids in understanding why CPU, Memory, and IO pressure data are high on a server or a container. Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required. Yafang Shao (3): bpf: Add bpf_for_each_cpu helper cgroup, psi: Init root cgroup psi to psi_system selftests/bpf: Add selftest for for_each_cpu include/linux/bpf.h | 1 + include/linux/psi.h | 2 +- include/uapi/linux/bpf.h | 32 +++++ kernel/bpf/bpf_iter.c | 72 +++++++++++ kernel/bpf/helpers.c | 2 + kernel/bpf/verifier.c | 29 ++++- kernel/cgroup/cgroup.c | 5 +- tools/include/uapi/linux/bpf.h | 32 +++++ .../selftests/bpf/prog_tests/for_each_cpu.c | 137 +++++++++++++++++++++ .../selftests/bpf/progs/test_for_each_cpu.c | 63 ++++++++++ 10 files changed, 372 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c