Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > On Wed, Sep 7, 2022 at 12:07 PM Stephen Brennan > <stephen.s.brennan@xxxxxxxxxx> wrote: >> >> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: >> > On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan >> > <stephen.s.brennan@xxxxxxxxxx> wrote: >> [...] >> >> Future Work >> >> ----------- >> >> >> >> If this proves acceptable, I'd like to follow-up with a kernel patch to >> >> add a configuration option (default=n) for generating BTF with all >> >> variables, which distributions could choose to enable or not. >> >> >> >> There was previous discussion[3] about leveraging split BTF or building >> >> additional kernel modules to contain the extra variables. I believe with >> >> this patch series, it is possible to do that. However, I'd argue that >> >> simpler is better here: the advantage for using BTF is having it all >> >> available in the kernel/module image. Storing extra BTF on the >> >> filesystem would break that advantage, and at that point, you'd be >> >> better off using a debuginfo format like CTF, which is lightweight and >> >> expected to be found on the filesystem. >> > >> > With all or nothing approach the distros would have a hard choice >> > to make whether to enable that kconfig, increase BTF and consume >> > extra memory without any obvious reason or just don't do it. >> > Majority probably is not going to enable it. >> > So the feature will become a single vendor only and with >> > inevitable bit-rot. >> >> I'd intend to support it even if just a single distribution enabled it. >> But I do see your concern. > > This thread was dormant for 8 days. > That's a poor example of "intend to support". You're right, I definitely could have replied sooner. I'm sorry for that. >> > Whereas with split BTF and extra kernel module approach >> > we can enable BTF with all global vars by default. >> > The extra module will be shipped by all distros and tools >> > like bpftrace might start using it. >> >> Split BTF is currently limited to a single base BTF file. We'd need more >> patches for pahole to support multiple --btf_base files: e.g. >> vmlinux.btf and vmlinux-variables.btf. There's also the question of >> modules: presumably we wouldn't try to have "$MODULE" and >> "$MODULE-btf-extra" modules due to the added complexity. I doubt the >> space savings would be worth it. >> >> I can look into these concerns, but if possible I would like to proceed >> with this series, as it is a separate concern from the exact mechanism >> by which we include extra BTF into the kernel. > > Not an option. Sorry. Ok, so let me describe what I understand to be the proposed design as of the previous thread, and see if it satisfies your concerns. We can work from there to make sure we've got a concensus design before going further. Option #1 --------- * A new module, "vmlinux-btf-extra" (or something roughly like that) is added, which contains BTF only. It is generated with --encode_all_btf_vars and uses --btf_base=path/to/vmlinux_btf so that it contains only BTF variables. The vmlinux BTF would be generated same as always (without the --encode_all_btf_vars). * In the previous thread, it was proposed [1] that modules could include variables in their BTF in order to reduce the complexity of the change. Modules would have their BTF generated using --encode_all_btf_vars and --btf_base=path/to/vmlinux_btf. The resulting hierarchy would look like this: vmlinux_btf [ functions and percpu vars only ] |- vmlinux-btf-extra [ all other vars for vmlinux ] |- $MODULE [ functions and all vars ] ... This option is desirable because it means that we only need 2-level split BTF, and so we don't actually need to make changes to pahole for multiple --btf_base files. There are two downsides I see: (a) While we save space on vmlinux BTF, each module will have a bit of extra data for variable types. On my laptop (5.15 based) I have 9.8 MB of BTF, and if you deduct vmlinux, you're still left with 4.7 MB. If we assume the same overhead of 23.7%, that would be 1.1 MB of extra module BTF for my particular use case. $ ls -l /sys/kernel/btf | awk '{sum += $5} END {print(sum)}' 9876871 $ ls -l /sys/kernel/btf/vmlinux -r--r--r-- 1 root root 5174406 Sep 7 14:20 /sys/kernel/btf/vmlinux (b) It's possible for "vmlinux-btf-extras" and "$MODULE" to contain duplicate type definitions, wasting additional space. However, as far as I understand it, this was already a possibility, e.g. $MODULE1 and $MODULE2 could already contain duplicate types. So I think this downside is no more. Option #2 --------- * The vmlinux-btf-extra module is still added as in Option #1. * Further, each module would have its own "$MODULE-btf-extra" module to add in extra BTF. These would be built with a --btf_base=$MODULE.ko and of course that BTF is based on vmlinux, so we would have: vmlinux_btf [ functions and percpu vars only ] |- vmlinux-btf-extras [ all other vars for vmlinux ] |- $MODULE [ functions and percpu vars only ] |- $MODULE-btf-extra [ all other vars for $MODULE ] This is much more complex, pahole must be extended to support a hierarchy of --btf_base files. The kernel itself may not need to understand multi-level BTF since there's no requirement that it actually understand $MODULE-btf-extra, so long as it exposes it via /sys/kernel/btf/$MODULE-btf-extra. I'd also like to see some sort of mechanism to allow an administrator to say "please always load $MODULE-btf-extras alongside $MODULE", but I think that would be a userspace problem. This resolves issue (a) from option #1, of course at implementation cost. Regardless of Option #1 or #2, I'd propose that we implement this as a tristate, similar to what Alan proposed [2]. When set to "m" we use the solutions described above, and when set to "y", we don't bother with it, instead using --encode_all_btf_vars for all generation. If we go with Option #1, no changes to this series should be necessary. If we go with Option #2, I'll need to extend pahole to support at least two BTF base files. Please let me know your thoughts. Stephen [1]: https://lore.kernel.org/bpf/CAEf4BzZmJKqXaJMBxhKqFNXzjO=eN5gk2xQVnmQVdK2xd3HQ=g@xxxxxxxxxxxxxx/ [2]: https://lore.kernel.org/bpf/alpine.LRH.2.23.451.2205032254390.10133@MyRouter/