On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > In kernel, we have a global variable > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > which is used in numerous places for per cpu data struct access. > > > > > > I am wondering whether we could have bpf code like > > > int nr_cpu_ids __ksym; I think this would be useful in general, though any __ksym variable like this would have to be const and mapped in .rodata, right? But yeah, being able to R/O map global variables like this which have static lifetimes would be nice. It's not quite the same thing as nr_cpu_ids, but FWIW, you could accomplish something close to this by doing something like this in your BPF prog: /* Set in user space to libbpf_num_possible_cpus() */ const volatile __u32 nr_cpus; ... __u32 i; bpf_for(i, 0, nr_cpus) bpf_printk("Iterating over cpu %u", i); ... > > > struct bpf_iter_num it; > > > int i = 0; > > > > > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > > > bpf_iter_num_new(&it, 1, nr_cpu_ids); > > > while ((v = bpf_iter_num_next(&it))) { > > > /* access cpu i data */ > > > i++; > > > } > > > bpf_iter_num_destroy(&it); > > > > > > From all existing open coded iterator loops, looks like > > > upper bound has to be a constant. We might need to extend support > > > to bounded scalar upper bound if not there. > > > > Currently the upper bound is required by both the open-coded for-loop > > and the bpf_loop. I think we can extend it. > > > > It can't handle the cpumask case either. > > > > for_each_cpu(cpu, mask) > > > > In the 'mask', the CPU IDs might not be continuous. In our container > > environment, we always use the cpuset cgroup for some critical tasks, > > but it is not so convenient to traverse the percpu data of this cpuset > > cgroup. We have to do it as follows for this case : > > > > That's why we prefer to introduce a bpf_for_each_cpu helper. It is > > fine if it can be implemented as a kfunc. > > I think open-coded-iterators is the only acceptable path forward here. > Since existing bpf_iter_num doesn't fit due to sparse cpumask, > let's introduce bpf_iter_cpumask and few additional kfuncs > that return cpu_possible_mask and others. I agree that this is the correct way to generalize this. The only thing that we'll have to figure out is how to generalize treating const struct cpumask * objects as kptrs. In sched_ext [0] we export scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to return trusted global cpumask kptrs that can then be "released" in scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and exists only to appease the verifier that the trusted cpumask kptrs aren't being leaked and are having their references "dropped". [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@xxxxxxxxxx/ I'd imagine that we have 2 ways forward if we want to enable progs to fetch other global cpumasks with static lifetimes (e.g. __cpu_possible_mask or nohz.idle_cpus_mask): 1. The most straightforward thing to do would be to add a new kfunc in kernel/bpf/cpumask.c that's a drop-in replacment for scx_bpf_put_idle_cpumask(): void bpf_global_cpumask_drop(const struct cpumask *cpumask) {} 2. Another would be to implement something resembling what Yonghong suggested in [1], where progs can link against global allocated kptrs like: const struct cpumask *__cpu_possible_mask __ksym; [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@xxxxxxxxx/#t In my opinion (1) is more straightforward, (2) is a better UX. Note again that both approaches only works for cpumasks with static lifetimes. I can't think of a way to treat dynamically allocated struct cpumask *objects as kptrs as there's nowhere to put a reference. If someone wants to track a dynamically allocated cpumask, they'd have to create a kptr out of its container object, and then pass that object's cpumask as a const struct cpumask * to other BPF cpumask kfuncs (including e.g. the proposed iterator). > We already have some cpumask support in kernel/bpf/cpumask.c > bpf_iter_cpumask will be a natural follow up. Yes, this should be easy to add. - David