Re: [RFC bpf-next 0/3] libbpf: Add support for aliased BPF programs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 6 Sep 2024 10:37:42 -0700

On Thu, Sep 5, 2024 at 10:04 PM Viktor Malik <vmalik@xxxxxxxxxx> wrote:
>
> On 9/4/24 21:07, Andrii Nakryiko wrote:
> > On Tue, Sep 3, 2024 at 1:19 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> >>
> >> On Mon, Sep 2, 2024 at 10:57 PM Viktor Malik <vmalik@xxxxxxxxxx> wrote:
> >>>
> >>> On 9/2/24 19:01, Alan Maguire wrote:
> >>>> On 02/09/2024 07:58, Viktor Malik wrote:
> >>>>> TL;DR
> >>>>>
> >>>>> This adds libbpf support for creating multiple BPF programs having the
> >>>>> same instructions using symbol aliases.
> >>>>>
> >>>>> Context
> >>>>> =======
> >>>>>
> >>>>> bpftrace has so-called "wildcarded" probes which allow to attach the
> >>>>> same program to multple different attach points. For k(u)probes, this is
> >>>>> easy to do as we can leverage k(u)probe_multi, however, other program
> >>>>> types (fentry/fexit, tracepoints) don't have such features.
> >>>>>
> >>>>> Currently, what bpftrace does is that it creates a copy of the program
> >>>>> for each attach point. This naturally results in a lot of redundant code
> >>>>> in the produced BPF object.
> >>>>>
> >>>>> Proposal
> >>>>> ========
> >>>>>
> >>>>> One way to address this problem would be to use *symbol aliases*. In
> >>>>> short, they allow to have multiple symbol table entries for the same
> >>>>> address. In bpftrace, we would create them using llvm::GlobalAlias. In
> >>>>> C, it can be achieved using compiler __attribute__((alias(...))):
> >>>>>
> >>>>>     int BPF_PROG(prog)
> >>>>>     {
> >>>>>         [...]
> >>>>>     }
> >>>>>     int prog_alias() __attribute__((alias("prog")));
> >>>>>
> >>>>> When calling bpf_object__open, libbpf is currently able to discover all
> >>>>> the programs and internally does a separate copy of the instructions for
> >>>>> each aliased program. What libbpf cannot do, is perform relocations b/c
> >>>>> it assumes that each instruction belongs to a single program only. The
> >>>>> second patch of this series changes relocation collection such that it
> >>>>> records relocations for each aliased program. With that, bpftrace can
> >>>>> emit just one copy of the full program and an alias for each target
> >>>>> attach point.
> >>>>>
> >>>>> For example, considering the following bpftrace script collecting the
> >>>>> number of hits of each VFS function using fentry over a one second
> >>>>> period:
> >>>>>
> >>>>>     $ bpftrace -e 'kfunc:vfs_* { @[func] = count() } i:s:1 { exit() }'
> >>>>>     [...]
> >>>>>
> >>>>> this change will allow to reduce the size of the in-memory BPF object
> >>>>> that bpftrace generates from 60K to 9K.
> >
> > Tbh, I'm not too keen on adding this aliasing complication just for
> > this. It seems a bit too intrusive for something so obscure.
> >
> > retsnoop doesn't need this and bypasses the issue by cloning with
> > bpf_prog_load(), can bpftrace follow the same model? If we need to add
> > some getters to bpf_program() to make this easier, that sounds like a
> > better long-term investment, API-wise.
>
> Yes, as I already mentioned, it should be possible to use cloning via
> bpf_prog_load(). The aliasing approach just seemed simpler from tool's
> perspective - you just emit one alias for each clone and libbpf takes
> care of the rest. But I admit that I anticipated the change to be much
> simpler than it turned out to be.
>
> Let me try add the cloning and I'll see if we could add something to
> libbpf to make it simpler.
>
> >
> >>>>>
> >>>>> For reference, the bpftrace PoC is in [1].
> >>>>>
> >>>>> The advantage of this change is that for BPF objects without aliases, it
> >>>>> doesn't introduce any overhead.
> >>>>>
> >>>>
> >>>> A few high-level questions - apologies in advance if I'm missing the
> >>>> point here.
> >>>>
> >>>> Could bpftrace use program linking to solve this issue instead? So we'd
> >>>> have separate progs for the various attach points associated with vfs_*
> >>>> functions, but they would all call the same global function. That
> >>>> _should_ reduce the memory footprint of the object I think - or are
> >>>> there issues with doing that?
> >>>
> >>> That's a good suggestion, thanks! We added subprograms to bpftrace only
> >>> relatively recently so I didn't really think about this option. I'll
> >>> definitely give it a try as it could be even more efficient.
> >>>
> >>>> I also wonder if aliasing helps memory
> >>>> footprint fully, especially if we end up with separate copies of the
> >>>> program for relocation purposes; won't we have separate copies in-kernel
> >>>> then too? So I _think_ the memory utilization you're concerned about is
> >>>> not what's running in the kernel, but the BPF object representation in
> >>>> bpftrace; is that right?
> >>>
> >>> Yes, that is correct. libbpf will create a copy of the program for each
> >>> symbol in PROGBITS section that it discovers (including aliases) and the
> >>> copies will be loaded into kernel.
> >>>
> >>> It's mainly the footprint of the BPF object produced by bpftrace that I
> >>> was concerned about. (The reason is that we work on ahead-of-time
> >>> compilation so it will directly affect the size of the pre-compiled
> >>> binaries). But the above solution using global subprograms should reduce
> >>> the in-kernel footprint, too, so I'll try to add it and see if it would
> >>> work for bpftrace.
> >>
> >> I think it's a half solution, since prog will be duplicated anyway and
> >> loaded multiple times into the kernel.
> >>
> >> I think it's better to revive this patch sets:
> >>
> >> v1:
> >> https://lore.kernel.org/bpf/20240220035105.34626-1-dongmenglong.8@xxxxxxxxxxxxx/
> >>
> >> v2:
> >> https://lore.kernel.org/bpf/20240311093526.1010158-1-dongmenglong.8@xxxxxxxxxxxxx/
> >>
> >> Unfortunately it went off rails, because we went deep into trying
> >> to save bpf trampoline memory too.
> >>
> >
> > +1 to adding multi-attach fentry, even if it will have to duplicate trampolines
>
> That would resolve fentry but the same problem still exists for (raw)
> tracepoints. In bpftrace, you may want to do something like
> `tracepoint:syscalls:sys_enter_* { ... }` and at the moment, we still
> need to the cloning in such case.

There are two problems for fentry: loading and attaching/detaching.
Given how fentry programs need to provide different attach_btf_id
during *load*, the load is the challenging part that requires cloning.

Then there is an attachment/detachment, which is just slow, but not
problematic logistically. Multi-fentry would solve/mitigate both.

But for tracepoints the load problem doesn't exist. You can load one
single instance of a program, and then attach multiple times to
different tracepoints. So cloning is not needed for tracepoints *at
all*.

If you want to know which tracepoint exactly you are processing at
runtime, use BPF cookie.

If you mean BTF-enabled raw tracepoints, then it's a separate thing
(and yes, there is a similar problem to fentry, of course).

But particularly for syscalls:sys_enter_*, I suspect the best solution
would be to attach generic sys_enter raw tracepoint and filter in
bpftrace by syscall number. Then it would be easy load, fast attach,
and most probably not worse runtime performance.

>
> But still, it does solve a part of the problem and could be useful for
> bpftrace to reduce the memory footprint for wildcarded fentry probes.
> I'll try to find some time to give this a shot (unless someone beats me
> to it).
>
> Viktor
>
> >
> >> I think it can still land.
> >> We can load fentry program once and attach it in multiple places
> >> with many bpf links.
> >> That is still worth doing.
> >
> > This part I'm confused about, but we can postpone discussion until
> > someone actually works on this.
> >
> >>
> >> I think it should solve retsnoop and bpftrace use cases.
> >> Not perfect, since there will be many trampolines, but still an improvement.
> >>
> >
> > retsnoop is fine already through bpf_prog_load()-based cloning, but
> > faster attachment for fentry/fexit would definitely help.
> >
>