On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@xxxxxxxxx> wrote: > > On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: > >> We don't have to select FUNCTION_ALIGNMENT_32B, so the > >> worst case is to increase ~2.2%. > >> > >> What do you think? > > > >Well, since I don't understand what you need this for at all, I'm firmly > >on the side of not doing this. > > > >What actual problem is being solved with this meta data nonsense? Why is > >it worth blowing up our I$ footprint over. > > > >Also note, that if you're going to be explaining this, start from > >scratch, as I have absolutely 0 clues about BPF and such. > > I would appreciate such information as well. The idea seems dubious on the surface. Ok, let me explain it from the beginning. (My English is not good, but I'll try to describe it as clear as possible :/) Many BPF program types need to depend on the BPF trampoline, such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between the kernel (or bpf) function and BPF program, and it acts just like the trampoline that ftrace uses. Generally speaking, it is used to hook a function, just like what ftrace do: foo: endbr nop5 --> call trampoline_foo xxxx In short, the trampoline_foo can be this: trampoline_foo: prepare a array and store the args of foo to the array call fentry_bpf1 call fentry_bpf2 ...... call foo+4 (origin call) save the return value of foo call fexit_bpf1 (this bpf can get the return value of foo) call fexit_bpf2 ....... return to the caller of foo We can see that the trampoline_foo can be only used for the function foo, as different kernel function can be attached different BPF programs, and have different argument count, etc. Therefore, we have to create 1000 BPF trampolines if we want to attach a BPF program to 1000 kernel functions. The creation of the BPF trampoline is expensive. According to my testing, It will spend more than 1 second to create 100 bpf trampoline. What's more, it consumes more memory. If we have the per-function metadata supporting, then we can create a global BPF trampoline, like this: trampoline_global: prepare a array and store the args of foo to the array get the metadata by the ip call metadata.fentry_bpf1 call metadata.fentry_bpf2 .... call foo+4 (origin call) save the return value of foo call metadata.fexit_bpf1 (this bpf can get the return value of foo) call metadata.fexit_bpf2 ....... return to the caller of foo (The metadata holds more information for the global trampoline than I described.) Then, we don't need to create a trampoline for every kernel function anymore. Another beneficiary can be ftrace. For now, all the kernel functions that are enabled by dynamic ftrace will be added to a filter hash if there are more than one callbacks. And hash lookup will happen when the traced functions are called, which has an impact on the performance, see __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function metadata supporting, we can store the information that if the callback is enabled on the kernel function to the metadata, which can make the performance much better. The per-function metadata storage is a basic function, and I think there may be other functions that can use it for better performance in the feature too. (Hope that I'm describing it clearly :/) Thanks! Menglong Dong