On 02/08/2023 17:33, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@xxxxxxxxxxxxx> wrote: >> I agree that this is the correct way to generalize this. The only thing >> that we'll have to figure out is how to generalize treating const struct >> cpumask * objects as kptrs. In sched_ext [0] we export >> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to >> return trusted global cpumask kptrs that can then be "released" in >> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and >> exists only to appease the verifier that the trusted cpumask kptrs >> aren't being leaked and are having their references "dropped". > > why is it KF_ACQUIRE ? > I think it can just return a trusted pointer without acquire. > >> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@xxxxxxxxxx/ >> >> I'd imagine that we have 2 ways forward if we want to enable progs to >> fetch other global cpumasks with static lifetimes (e.g. >> __cpu_possible_mask or nohz.idle_cpus_mask): >> >> 1. The most straightforward thing to do would be to add a new kfunc in >> kernel/bpf/cpumask.c that's a drop-in replacment for >> scx_bpf_put_idle_cpumask(): >> >> void bpf_global_cpumask_drop(const struct cpumask *cpumask) >> {} >> >> 2. Another would be to implement something resembling what Yonghong >> suggested in [1], where progs can link against global allocated kptrs >> like: >> >> const struct cpumask *__cpu_possible_mask __ksym; >> >> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@xxxxxxxxx/#t >> >> In my opinion (1) is more straightforward, (2) is a better UX. > > 1 = adding few kfuncs. > 2 = teaching pahole to emit certain global vars. > > nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l > 1998 > > imo BTF increase trade off is acceptable. Agreed, Stephen's numbers on BTF size increase were pretty modest [1]. What was gating that work in my mind was previous discussion around splitting aspects of BTF into a "vmlinux-extra". Experiments with this seemed to show it's hard to support, and worse, tooling would have to learn about its existence. We have to come up with a CONFIG convention about specifying what ends up in -extra versus core vmlinux BTF, what do we do about modules, etc. All feels like over-complication. I think a better path would be to support BTF in a vmlinux BTF module (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is separately loadable, but puts vmlinux in the same place for tools - /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size for embedded use cases that have come up a few times, and lessens concerns about BTF size for other users, while it all works with existing tooling. I have a basic proof-of-concept but it will take time to hammer into shape. Because variable-related size increases are pretty modest, so should we proceed with the BTF variable support anyway? We can modularize BTF separately later on for those concerned about BTF size. [1] https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@xxxxxxxxxx/