Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset

Menglong Dong <menglong8.dong@xxxxxxxxx> · Wed, 5 Mar 2025 09:19:09 +0800

On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>
> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> >> worst case is to increase ~2.2%.
> >>
> >> What do you think?
> >
> >Well, since I don't understand what you need this for at all, I'm firmly
> >on the side of not doing this.
> >
> >What actual problem is being solved with this meta data nonsense? Why is
> >it worth blowing up our I$ footprint over.
> >
> >Also note, that if you're going to be explaining this, start from
> >scratch, as I have absolutely 0 clues about BPF and such.
>
> I would appreciate such information as well. The idea seems dubious on the surface.

Ok, let me explain it from the beginning. (My English is not good,
but I'll try to describe it as clear as possible :/)

Many BPF program types need to depend on the BPF trampoline,
such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
the kernel (or bpf) function and BPF program, and it acts just like the
trampoline that ftrace uses.

Generally speaking, it is used to hook a function, just like what ftrace
do:

foo:
    endbr
    nop5  -->  call trampoline_foo
    xxxx

In short, the trampoline_foo can be this:

trampoline_foo:
    prepare a array and store the args of foo to the array
    call fentry_bpf1
    call fentry_bpf2
    ......
    call foo+4 (origin call)
    save the return value of foo
    call fexit_bpf1 (this bpf can get the return value of foo)
    call fexit_bpf2
    .......
    return to the caller of foo

We can see that the trampoline_foo can be only used for
the function foo, as different kernel function can be attached
different BPF programs, and have different argument count,
etc. Therefore, we have to create 1000 BPF trampolines if
we want to attach a BPF program to 1000 kernel functions.

The creation of the BPF trampoline is expensive. According to
my testing, It will spend more than 1 second to create 100 bpf
trampoline. What's more, it consumes more memory.

If we have the per-function metadata supporting, then we can
create a global BPF trampoline, like this:

trampoline_global:
    prepare a array and store the args of foo to the array
    get the metadata by the ip
    call metadata.fentry_bpf1
    call metadata.fentry_bpf2
    ....
    call foo+4 (origin call)
    save the return value of foo
    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
    call metadata.fexit_bpf2
    .......
    return to the caller of foo

(The metadata holds more information for the global trampoline than
I described.)

Then, we don't need to create a trampoline for every kernel function
anymore.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata, which can make the performance
much better.

The per-function metadata storage is a basic function, and I think there
may be other functions that can use it for better performance in the feature
too.

(Hope that I'm describing it clearly :/)

Thanks!
Menglong Dong