On Thu, Sep 5, 2024 at 10:04 PM Viktor Malik <vmalik@xxxxxxxxxx> wrote: > > On 9/4/24 21:07, Andrii Nakryiko wrote: > > On Tue, Sep 3, 2024 at 1:19 PM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > >> > >> On Mon, Sep 2, 2024 at 10:57 PM Viktor Malik <vmalik@xxxxxxxxxx> wrote: > >>> > >>> On 9/2/24 19:01, Alan Maguire wrote: > >>>> On 02/09/2024 07:58, Viktor Malik wrote: > >>>>> TL;DR > >>>>> > >>>>> This adds libbpf support for creating multiple BPF programs having the > >>>>> same instructions using symbol aliases. > >>>>> > >>>>> Context > >>>>> ======= > >>>>> > >>>>> bpftrace has so-called "wildcarded" probes which allow to attach the > >>>>> same program to multple different attach points. For k(u)probes, this is > >>>>> easy to do as we can leverage k(u)probe_multi, however, other program > >>>>> types (fentry/fexit, tracepoints) don't have such features. > >>>>> > >>>>> Currently, what bpftrace does is that it creates a copy of the program > >>>>> for each attach point. This naturally results in a lot of redundant code > >>>>> in the produced BPF object. > >>>>> > >>>>> Proposal > >>>>> ======== > >>>>> > >>>>> One way to address this problem would be to use *symbol aliases*. In > >>>>> short, they allow to have multiple symbol table entries for the same > >>>>> address. In bpftrace, we would create them using llvm::GlobalAlias. In > >>>>> C, it can be achieved using compiler __attribute__((alias(...))): > >>>>> > >>>>> int BPF_PROG(prog) > >>>>> { > >>>>> [...] > >>>>> } > >>>>> int prog_alias() __attribute__((alias("prog"))); > >>>>> > >>>>> When calling bpf_object__open, libbpf is currently able to discover all > >>>>> the programs and internally does a separate copy of the instructions for > >>>>> each aliased program. What libbpf cannot do, is perform relocations b/c > >>>>> it assumes that each instruction belongs to a single program only. The > >>>>> second patch of this series changes relocation collection such that it > >>>>> records relocations for each aliased program. With that, bpftrace can > >>>>> emit just one copy of the full program and an alias for each target > >>>>> attach point. > >>>>> > >>>>> For example, considering the following bpftrace script collecting the > >>>>> number of hits of each VFS function using fentry over a one second > >>>>> period: > >>>>> > >>>>> $ bpftrace -e 'kfunc:vfs_* { @[func] = count() } i:s:1 { exit() }' > >>>>> [...] > >>>>> > >>>>> this change will allow to reduce the size of the in-memory BPF object > >>>>> that bpftrace generates from 60K to 9K. > > > > Tbh, I'm not too keen on adding this aliasing complication just for > > this. It seems a bit too intrusive for something so obscure. > > > > retsnoop doesn't need this and bypasses the issue by cloning with > > bpf_prog_load(), can bpftrace follow the same model? If we need to add > > some getters to bpf_program() to make this easier, that sounds like a > > better long-term investment, API-wise. > > Yes, as I already mentioned, it should be possible to use cloning via > bpf_prog_load(). The aliasing approach just seemed simpler from tool's > perspective - you just emit one alias for each clone and libbpf takes > care of the rest. But I admit that I anticipated the change to be much > simpler than it turned out to be. > > Let me try add the cloning and I'll see if we could add something to > libbpf to make it simpler. > > > > >>>>> > >>>>> For reference, the bpftrace PoC is in [1]. > >>>>> > >>>>> The advantage of this change is that for BPF objects without aliases, it > >>>>> doesn't introduce any overhead. > >>>>> > >>>> > >>>> A few high-level questions - apologies in advance if I'm missing the > >>>> point here. > >>>> > >>>> Could bpftrace use program linking to solve this issue instead? So we'd > >>>> have separate progs for the various attach points associated with vfs_* > >>>> functions, but they would all call the same global function. That > >>>> _should_ reduce the memory footprint of the object I think - or are > >>>> there issues with doing that? > >>> > >>> That's a good suggestion, thanks! We added subprograms to bpftrace only > >>> relatively recently so I didn't really think about this option. I'll > >>> definitely give it a try as it could be even more efficient. > >>> > >>>> I also wonder if aliasing helps memory > >>>> footprint fully, especially if we end up with separate copies of the > >>>> program for relocation purposes; won't we have separate copies in-kernel > >>>> then too? So I _think_ the memory utilization you're concerned about is > >>>> not what's running in the kernel, but the BPF object representation in > >>>> bpftrace; is that right? > >>> > >>> Yes, that is correct. libbpf will create a copy of the program for each > >>> symbol in PROGBITS section that it discovers (including aliases) and the > >>> copies will be loaded into kernel. > >>> > >>> It's mainly the footprint of the BPF object produced by bpftrace that I > >>> was concerned about. (The reason is that we work on ahead-of-time > >>> compilation so it will directly affect the size of the pre-compiled > >>> binaries). But the above solution using global subprograms should reduce > >>> the in-kernel footprint, too, so I'll try to add it and see if it would > >>> work for bpftrace. > >> > >> I think it's a half solution, since prog will be duplicated anyway and > >> loaded multiple times into the kernel. > >> > >> I think it's better to revive this patch sets: > >> > >> v1: > >> https://lore.kernel.org/bpf/20240220035105.34626-1-dongmenglong.8@xxxxxxxxxxxxx/ > >> > >> v2: > >> https://lore.kernel.org/bpf/20240311093526.1010158-1-dongmenglong.8@xxxxxxxxxxxxx/ > >> > >> Unfortunately it went off rails, because we went deep into trying > >> to save bpf trampoline memory too. > >> > > > > +1 to adding multi-attach fentry, even if it will have to duplicate trampolines > > That would resolve fentry but the same problem still exists for (raw) > tracepoints. In bpftrace, you may want to do something like > `tracepoint:syscalls:sys_enter_* { ... }` and at the moment, we still > need to the cloning in such case. There are two problems for fentry: loading and attaching/detaching. Given how fentry programs need to provide different attach_btf_id during *load*, the load is the challenging part that requires cloning. Then there is an attachment/detachment, which is just slow, but not problematic logistically. Multi-fentry would solve/mitigate both. But for tracepoints the load problem doesn't exist. You can load one single instance of a program, and then attach multiple times to different tracepoints. So cloning is not needed for tracepoints *at all*. If you want to know which tracepoint exactly you are processing at runtime, use BPF cookie. If you mean BTF-enabled raw tracepoints, then it's a separate thing (and yes, there is a similar problem to fentry, of course). But particularly for syscalls:sys_enter_*, I suspect the best solution would be to attach generic sys_enter raw tracepoint and filter in bpftrace by syscall number. Then it would be easy load, fast attach, and most probably not worse runtime performance. > > But still, it does solve a part of the problem and could be useful for > bpftrace to reduce the memory footprint for wildcarded fentry probes. > I'll try to find some time to give this a shot (unless someone beats me > to it). > > Viktor > > > > >> I think it can still land. > >> We can load fentry program once and attach it in multiple places > >> with many bpf links. > >> That is still worth doing. > > > > This part I'm confused about, but we can postpone discussion until > > someone actually works on this. > > > >> > >> I think it should solve retsnoop and bpftrace use cases. > >> Not perfect, since there will be many trampolines, but still an improvement. > >> > > > > retsnoop is fine already through bpf_prog_load()-based cloning, but > > faster attachment for fentry/fexit would definitely help. > > >