On Tue, Apr 14, 2020 at 4:59 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 4/13/20 10:56 PM, Andrii Nakryiko wrote: > > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@xxxxxx> wrote: > >> > >> Given a loaded dumper bpf program, which already > >> knows which target it should bind to, there > >> two ways to create a dumper: > >> - a file based dumper under hierarchy of > >> /sys/kernel/bpfdump/ which uses can > >> "cat" to print out the output. > >> - an anonymous dumper which user application > >> can "read" the dumping output. > >> > >> For file based dumper, BPF_OBJ_PIN syscall interface > >> is used. For anonymous dumper, BPF_PROG_ATTACH > >> syscall interface is used. > > > > We discussed this offline with Yonghong a bit, but I thought I'd put > > my thoughts about this in writing for completeness. To me, it seems > > like the most consistent way to do both anonymous and named dumpers is > > through the following steps: > > The main motivation for me to use bpf_link is to enumerate > anonymous bpf dumpers by using idr based link_query mechanism in one > of previous Andrii's RFC patch so I do not need to re-invent the wheel. > > But looks like there are some difficulties: > > > > > 1. BPF_PROG_LOAD to load/verify program, that created program FD. > > 2. LINK_CREATE using that program FD and direntry FD. This creates > > dumper bpf_link (bpf_dumper_link), returns anonymous link FD. If link > > bpf dump program already have the target information as part of > verification propose, so it does not need directory FD. > LINK_CREATE probably not a good fit here. > > bpf dump program is kind similar to fentry/fexit program, > where after successful program loading, the program will know > where to attach trampoline. > > Looking at kernel code, for fentry/fexit program, at raw_tracepoint_open > syscall, the trampoline will be installed and actually bpf program will > be called. > direntry FD doesn't have to be specified at attach time, I forgot that it is already provided during load. That wasn't a requirement or critical part. I think if we already had LINK_CREATE command, we'd never have to create RAW_TRACEPOINT_OPEN one, all of them could be the same command. > So, ideally, if we want to use kernel bpf_link, we want to > return a cat-able bpf_link because ultimately we want to query > file descriptors which actually 'read' bpf program outputs. > > Current bpf_link is not cat-able. Let's be precise here. By cat-able you mean that you'd like to just start issuing read() calls and get output of bpfdump program, is that right? Wouldn't that mean that you can read output just once? So it won't be possible to create anonymous dumper and periodically get up-to-date output. User would need to call RAW_TRACEPOINT_OPEN every single time it would need to do a dump. I guess that would work, but I'm not seeing why it has to be that way. What I proposed above was that once you create a bpf_link, you can use that same bpf_link to open many seq_files, each with its own FD, which can be read() independently of each other. This behavior would be consistent with named bpfdumper, which can produce many independent seq_files with each new open() syscall, but all from exactly the same attached bpfdumper. > I try to hack by manipulating fops and other stuff, it may work, > but looks ugly. Or we create a bpf_catable_link and build an > infrastructure around that? Not sure whether it is worthwhile for this > one-off thing (bpfdump)? > > Or to query anonymous bpf dumpers, I can just write a bpf dump program > to go through all fd's to find out. > > BTW, my current approach (in my private branch), > anonymous dumper: > bpf_raw_tracepoint_open(NULL, prog) -> cat-able fd So just to re-iterate. If my understanding is correct, this cat-able fd is a single seq_file. If you want to dump it again, you would call bpf_raw_tracepoint_open() again? > file dumper: > bpf_obj_pin(prog, path) -> a cat-able file While in this case, you'd open() as many times as you need and get new cat-able fd for each of those calls. > > If you consider program itself is a link, this is like what > described below in 3 and 4. Program is not a link. Same as cgroup BPF program attached somewhere to a cgroup is not a link. Because that BPF program can be attached to multiple cgroups or even under multiple attach types to the same cgroup. Same here, same dumper can be "attached" in bpfdumpfs multiple times, and each instance of attachment is link, but it's still the same program. > > > > FD is closed, dumper program is detached and dumper is destroyed > > (unless pinned in bpffs, just like with any other bpf_link. > > 3. At this point bpf_dumper_link can be treated like a factory of > > seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for > > illustration purposes) command, that accepts dumper link FD and > > returns a new seq_file FD, which can be read() normally (or, e.g., > > cat'ed from shell). > > In this case, link_query may not be accurate if a bpf_dumper_link > is created but no corresponding bpf_dumper_open_file. What we really > need to iterate through all dumper seq_file FDs. If the goal is to iterate all the open seq_files (i.e., bpfdump active sessions), then bpf_link is clearly not the right approach. But I thought we are talking about iterating all the bpfdump programs attachments, not **sessions**, in which case bpf_link is exactly the right approach. > > > 4. Additionally, this anonymous bpf_link can be pinned/mounted in > > bpfdumpfs. We can do it as BPF_OBJ_PIN or as a separate command. Once > > pinned at, e.g., /sys/fs/bpfdump/task/my_dumper, just opening that > > file is equivalent to BPF_DUMPER_OPEN_FILE and will create a new > > seq_file that can be read() independently from other seq_files opened > > against the same dumper. Pinning bpfdumpfs entry also bumps refcnt of > > bpf_link itself, so even if process that created link dies, bpf dumper > > stays attached until its bpfdumpfs entry is deleted. > > > > Apart from BPF_DUMPER_OPEN_FILE and open()'ing bpfdumpfs file duality, > > it seems pretty consistent and follows safe-by-default auto-cleanup of > > anonymous link, unless pinned in bpfdumpfs (or one can still pin > > bpf_link in bpffs, but it can't be open()'ed the same way, it just > > preserves BPF program from being cleaned up). > > > > Out of all schemes I could come up with, this one seems most unified > > and nicely fits into bpf_link infra. Thoughts? > > > >> > >> To facilitate target seq_ops->show() to get the > >> bpf program easily, dumper creation increased > >> the target-provided seq_file private data size > >> so bpf program pointer is also stored in seq_file > >> private data. > >> > >> Further, a seq_num which represents how many > >> bpf_dump_get_prog() has been called is also > >> available to the target seq_ops->show(). > >> Such information can be used to e.g., print > >> banner before printing out actual data. > >> > >> Note the seq_num does not represent the num > >> of unique kernel objects the bpf program has > >> seen. But it should be a good approximate. > >> > >> A target feature BPF_DUMP_SEQ_NET_PRIVATE > >> is implemented specifically useful for > >> net based dumpers. It sets net namespace > >> as the current process net namespace. > >> This avoids changing existing net seq_ops > >> in order to retrieve net namespace from > >> the seq_file pointer. > >> > >> For open dumper files, anonymous or not, the > >> fdinfo will show the target and prog_id associated > >> with that file descriptor. For dumper file itself, > >> a kernel interface will be provided to retrieve the > >> prog_id in one of the later patches. > >> > >> Signed-off-by: Yonghong Song <yhs@xxxxxx> > >> --- > >> include/linux/bpf.h | 5 + > >> include/uapi/linux/bpf.h | 6 +- > >> kernel/bpf/dump.c | 338 ++++++++++++++++++++++++++++++++- > >> kernel/bpf/syscall.c | 11 +- > >> tools/include/uapi/linux/bpf.h | 6 +- > >> 5 files changed, 362 insertions(+), 4 deletions(-) > >> > > > > [...] > >