Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu

Alan Maguire <alan.maguire@xxxxxxxxxx> · Thu, 3 Aug 2023 09:21:19 +0100

On 02/08/2023 17:33, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@xxxxxxxxxxxxx> wrote:
>> I agree that this is the correct way to generalize this. The only thing
>> that we'll have to figure out is how to generalize treating const struct
>> cpumask * objects as kptrs. In sched_ext [0] we export
>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>> return trusted global cpumask kptrs that can then be "released" in
>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>> exists only to appease the verifier that the trusted cpumask kptrs
>> aren't being leaked and are having their references "dropped".
> 
> why is it KF_ACQUIRE ?
> I think it can just return a trusted pointer without acquire.
> 
>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@xxxxxxxxxx/
>>
>> I'd imagine that we have 2 ways forward if we want to enable progs to
>> fetch other global cpumasks with static lifetimes (e.g.
>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>
>> 1. The most straightforward thing to do would be to add a new kfunc in
>>    kernel/bpf/cpumask.c that's a drop-in replacment for
>>    scx_bpf_put_idle_cpumask():
>>
>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>> {}
>>
>> 2. Another would be to implement something resembling what Yonghong
>>    suggested in [1], where progs can link against global allocated kptrs
>>    like:
>>
>> const struct cpumask *__cpu_possible_mask __ksym;
>>
>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@xxxxxxxxx/#t
>>
>> In my opinion (1) is more straightforward, (2) is a better UX.
> 
> 1 = adding few kfuncs.
> 2 = teaching pahole to emit certain global vars.
> 
> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
> 1998
> 
> imo BTF increase trade off is acceptable.

Agreed, Stephen's numbers on BTF size increase were pretty modest [1].

What was gating that work in my mind was previous discussion around
splitting aspects of BTF into a "vmlinux-extra". Experiments with this
seemed to show it's hard to support, and worse, tooling would have to
learn about its existence. We have to come up with a CONFIG convention
about specifying what ends up in -extra versus core vmlinux BTF, what do
we do about modules, etc. All feels like over-complication.

I think a better path would be to support BTF in a vmlinux BTF module
(controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
separately loadable, but puts vmlinux in the same place for tools -
/sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
for embedded use cases that have come up a few times, and lessens
concerns about BTF size for other users, while it all works with
existing tooling. I have a basic proof-of-concept but it will take time
to hammer into shape.

Because variable-related size increases are pretty modest, so should we
proceed with the BTF variable support anyway? We can modularize BTF
separately later on for those concerned about BTF size.

[1]
https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@xxxxxxxxxx/