Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu

Alan Maguire <alan.maguire@xxxxxxxxxx> · Thu, 3 Aug 2023 17:10:57 +0100

On 03/08/2023 16:22, Yonghong Song wrote:
> 
> 
> On 8/3/23 1:21 AM, Alan Maguire wrote:
>> On 02/08/2023 17:33, Alexei Starovoitov wrote:
>>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@xxxxxxxxxxxxx> wrote:
>>>> I agree that this is the correct way to generalize this. The only thing
>>>> that we'll have to figure out is how to generalize treating const
>>>> struct
>>>> cpumask * objects as kptrs. In sched_ext [0] we export
>>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>>>> return trusted global cpumask kptrs that can then be "released" in
>>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>>>> exists only to appease the verifier that the trusted cpumask kptrs
>>>> aren't being leaked and are having their references "dropped".
>>>
>>> why is it KF_ACQUIRE ?
>>> I think it can just return a trusted pointer without acquire.
>>>
>>>> [0]:
>>>> https://lore.kernel.org/all/20230711011412.100319-13-tj@xxxxxxxxxx/
>>>>
>>>> I'd imagine that we have 2 ways forward if we want to enable progs to
>>>> fetch other global cpumasks with static lifetimes (e.g.
>>>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>>>
>>>> 1. The most straightforward thing to do would be to add a new kfunc in
>>>>     kernel/bpf/cpumask.c that's a drop-in replacment for
>>>>     scx_bpf_put_idle_cpumask():
>>>>
>>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>>>> {}
>>>>
>>>> 2. Another would be to implement something resembling what Yonghong
>>>>     suggested in [1], where progs can link against global allocated
>>>> kptrs
>>>>     like:
>>>>
>>>> const struct cpumask *__cpu_possible_mask __ksym;
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@xxxxxxxxx/#t
>>>>
>>>> In my opinion (1) is more straightforward, (2) is a better UX.
>>>
>>> 1 = adding few kfuncs.
>>> 2 = teaching pahole to emit certain global vars.
>>>
>>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
>>> 1998
>>>
>>> imo BTF increase trade off is acceptable.
>>
>> Agreed, Stephen's numbers on BTF size increase were pretty modest [1].
>>
>> What was gating that work in my mind was previous discussion around
>> splitting aspects of BTF into a "vmlinux-extra". Experiments with this
>> seemed to show it's hard to support, and worse, tooling would have to
>> learn about its existence. We have to come up with a CONFIG convention
>> about specifying what ends up in -extra versus core vmlinux BTF, what do
>> we do about modules, etc. All feels like over-complication.
>>
>> I think a better path would be to support BTF in a vmlinux BTF module
>> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
>> separately loadable, but puts vmlinux in the same place for tools -
>> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
>> for embedded use cases that have come up a few times, and lessens
>> concerns about BTF size for other users, while it all works with
>> existing tooling. I have a basic proof-of-concept but it will take time
>> to hammer into shape.
>>
>> Because variable-related size increases are pretty modest, so should we
>> proceed with the BTF variable support anyway? We can modularize BTF
>> separately later on for those concerned about BTF size.
> 
> Alan, it seems a consensus has reached that we should include
> global variables (excluding special kernel made ones like
> __SCK_ and __tracepoint_) in vmlinux BTF.
> please go ahead and propose a patch. Thanks!
>

Sounds good! Stephen and I will hopefully have something ready soon;
the changes are mostly in dwarves plus a kernel selftest. And
good idea on the filtering; this will eliminate a lot of unneeded
variables and cut down on size overheads. Thanks!

Alan

>>
>> [1]
>> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@xxxxxxxxxx/