On Fri, Oct 29, 2021 at 9:12 AM Mauricio Vásquez Bernal <mauricio@xxxxxxxxxx> wrote: > > > Tracing progs will be peeking into task_struct. > > To describe it in the reduced BTF most of the kernel types would be needed, > > That's the point of our algorithm. If a program is only accessing > 'pid' on 'task_struct' we'll generate a BTF representation of > task_struct that only contains the 'pid' member with the right offset, > other members are not included and hence we don't need to carry on all > those types that are not used by the program. > > > Have you considered generating kernel BTF with fields that are accessed > > by bpf prog only and replacing all other fields with padding ? > > Yeah. We're implicitly doing it as described above. > > > I think the algo would be quite different from the actual CO-RE logic > > you're trying to reuse. > > I'm not 100% sure it's so easy to do without reimplementing much of > the actual CO-RE logic. We don't want to copy all type members but > only the ones actually used. So I'm not sure if Andrii's idea of > performing the matching based only on the type name will work. I'll > try to get deep into the details and will be back to you soon. No, now that I understand what exactly you are doing, it won't work. But ok, I gave it quite a bit of thought and tried to find a good compromise between completely exposing all the libbpf internals as public APIs (which is a huge price I'm not willing to accept) and actually allowing you to achieve your goal (which I think is worthy to achieve). But first. https://github.com/aquasecurity/btfhub/tree/main/tools is awesome. Great work explaining a lot at exactly the right level of technical details. It would be great if you published it as a dedicated blog post, maybe splitting the more general information from the BTF minimization problem. Both are useful, but it's too long for one article. Great job, really! Now, to the problem at hand. And sorry for a long reply, but there is quite a bit of things to unpack. I see this overall problem as two distinct problems: 1. Knowing which fields and types libbpf is using from kernel BTF. Basically, observe CO-RE logic from outside. 2. Knowing information from #1, minimize BTF. Part #2 absolutely doesn't belong in libbpf. Libbpf exposes enough BTF constructing APIs to implement this in any application, bpftool or otherwise. It's also a relatively straightforward problem: mark used types and fields, create a copy of BTF with only those types and fields. So let's talk about #1, because I agree that it's extremely painful to have to reimplement most of CO-RE logic just for getting the list of types and fields. Here we have two sub-problems (assuming we let libbpf do CO-RE relocation logic for us): a) perform CO-RE relocations but don't create BPF maps and load BPF programs. Dry-run of sorts. b) exposing calculated relocation information. First, 1a. The problem right now is that CO-RE relocations (and relocations in general) happen in the same bpf_object__load() phase and it's not possible to do them without creating maps and subsequently loading BPF programs first. This is very suboptimal. I've actually been thinking in the background how to improve that situation, because even with the recent bpf_program__insns() API, added a few days ago, you still have to load the BPF program to be able to clone the BPF program, and so I wanted to solve that for a long while now. So how about we split bpf_object__load() phase into two: bpf_object__prepare() and bpf_object__load(). prepare() will do all the preparations (subprogs, ELF relos, also almost complete BPF map relos, but not quite; I'll get to this), basically everything that load does today short of actually creating maps and progs. load() then will ensure that bpf_object__prepare() was called (i.e., user can do just prepare(), prepare() + load() explicitly, or load() which will call prepare() implicitly). The biggest problem I see right now is what we do about BPF map relocations. I propose to perform map relocations but substitute map's internal index (so that if someone dumps prog instructions after prepare(), it's still meaningful, even if not yet validatable by verifier). After maps are actually created, we can do another quick pass over just RELO_DATA and replace map_idx with map's fd. It feels like we should split load further into two steps: creating and pinning maps (+ finalizing FDs in instructions) and actually loading programs. Again, with the same implicit calling of prepare and map creation steps if the user only calls bpf_object__load(). For ease of use and backwards compatibility. So basically, the most granular set of steps would be: 1. bpf_object__open() 2. bpf_object__prepare() (or bpf_object__relocate(), naming is hard) 3. bpf_object__create_maps(); 4. bpf_object__load(). But the old and trusty bpf_object__open() + bpf_object__load() will work just as well, with load() doing steps #2-#4 automatically, if necessary. So what does that split gives us. Few things: - it's possible to "simulate" almost all libbpf logic without changing the state of the system (no maps or progs created). Yet you still validate that kconfig externs, per-cpu externs, CO-RE relos, map relos, etc, all that works. - libbpf can store extra information between steps 1, 2, 3, and 4, but after step #4 all that extra information can be discarded and cleaned up. So advanced users will get access to stuff like bpf_program__insns() and CO-RE information, but most users won't have to pay for that because it will be freed after bpf_object__load(). So in this case, bpf_program__insns() could (should?) be discarded after actual load, because if you care about instructions, you can do steps #1-#3, grab instructions and copy them, if necessary. Then proceed to #4 (or not) and free the memory. The last point is important, because to solve the problem 1b (exposing CO-RE relo info), the best way to minimize public API commitments is to (optionally, probably) request libbpf to record its CO-RE relo decisions. Here's what I propose, specifically: 1. Add something like "bool record_core_relo_info" (awful name, don't use it) in bpf_object_open_opts. 2. If it is set to true, libbpf will keep a "log" of CO-RE relocation decisions, recording stuff like program name, instruction index, local spec (i.e., root_type_id, spec string, relo kind, maybe something else), target spec (kernel type_id, kernel spec string, also module ID, if it's not vmlinux BTF). We can also record relocated value (i.e., field offset, actual enum value, true/false for existence, etc). All these are stable concepts, so I'd feel more comfortable exposing them, compared to stuff like bpf_core_accessor and other internal details. 3. The memory for all that will be managed by libbpf for simplicity of an API, and we'll expose accessors to get those arrays (at object level or per-program level is TBD). 4. This info will be available after the prepare() step and will be discarded either at create_maps() or load(). I think that solves problem #1 completely (at least for BTFGen case) and actually provides more useful information. E.g., if someone wants to track CO-RE resolution logic for some other reason, they should have pretty full information (BTFGen doesn't need all of that). It also doesn't feel like too much complication to libbpf internals (even though we'll need to be very careful with BPF map relos and double-checking that we haven't missed any other critical part of the process). And for most users there is no change in API or behavior. And given this gives us a "framework" for more sustainable libbpf "observability", I think it's a justified price to pay, overall. I still need to sleep on this, but this feels like a bit cleaner way forward. Thoughts are welcome. > > > If CO-RE matching style is necessary and it's the best approach then please > > add new logic to bpftool. I'm not sure such api would be > > useful beyond this particular case to expose as stable libbpf api. > > I agree 100%. Our goal is to have this available on bpftool so all the > community can use it. However it'd be a shame if we can't use some of > the existing logic in libbpf.