On 4/13/20 10:56 PM, Andrii Nakryiko wrote:
On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@xxxxxx> wrote:
Given a loaded dumper bpf program, which already
knows which target it should bind to, there
two ways to create a dumper:
- a file based dumper under hierarchy of
/sys/kernel/bpfdump/ which uses can
"cat" to print out the output.
- an anonymous dumper which user application
can "read" the dumping output.
For file based dumper, BPF_OBJ_PIN syscall interface
is used. For anonymous dumper, BPF_PROG_ATTACH
syscall interface is used.
We discussed this offline with Yonghong a bit, but I thought I'd put
my thoughts about this in writing for completeness. To me, it seems
like the most consistent way to do both anonymous and named dumpers is
through the following steps:
The main motivation for me to use bpf_link is to enumerate
anonymous bpf dumpers by using idr based link_query mechanism in one
of previous Andrii's RFC patch so I do not need to re-invent the wheel.
But looks like there are some difficulties:
1. BPF_PROG_LOAD to load/verify program, that created program FD.
2. LINK_CREATE using that program FD and direntry FD. This creates
dumper bpf_link (bpf_dumper_link), returns anonymous link FD. If link
bpf dump program already have the target information as part of
verification propose, so it does not need directory FD.
LINK_CREATE probably not a good fit here.
bpf dump program is kind similar to fentry/fexit program,
where after successful program loading, the program will know
where to attach trampoline.
Looking at kernel code, for fentry/fexit program, at raw_tracepoint_open
syscall, the trampoline will be installed and actually bpf program will
be called.
So, ideally, if we want to use kernel bpf_link, we want to
return a cat-able bpf_link because ultimately we want to query
file descriptors which actually 'read' bpf program outputs.
Current bpf_link is not cat-able.
I try to hack by manipulating fops and other stuff, it may work,
but looks ugly. Or we create a bpf_catable_link and build an
infrastructure around that? Not sure whether it is worthwhile for this
one-off thing (bpfdump)?
Or to query anonymous bpf dumpers, I can just write a bpf dump program
to go through all fd's to find out.
BTW, my current approach (in my private branch),
anonymous dumper:
bpf_raw_tracepoint_open(NULL, prog) -> cat-able fd
file dumper:
bpf_obj_pin(prog, path) -> a cat-able file
If you consider program itself is a link, this is like what
described below in 3 and 4.
FD is closed, dumper program is detached and dumper is destroyed
(unless pinned in bpffs, just like with any other bpf_link.
3. At this point bpf_dumper_link can be treated like a factory of
seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
illustration purposes) command, that accepts dumper link FD and
returns a new seq_file FD, which can be read() normally (or, e.g.,
cat'ed from shell).
In this case, link_query may not be accurate if a bpf_dumper_link
is created but no corresponding bpf_dumper_open_file. What we really
need to iterate through all dumper seq_file FDs.
4. Additionally, this anonymous bpf_link can be pinned/mounted in
bpfdumpfs. We can do it as BPF_OBJ_PIN or as a separate command. Once
pinned at, e.g., /sys/fs/bpfdump/task/my_dumper, just opening that
file is equivalent to BPF_DUMPER_OPEN_FILE and will create a new
seq_file that can be read() independently from other seq_files opened
against the same dumper. Pinning bpfdumpfs entry also bumps refcnt of
bpf_link itself, so even if process that created link dies, bpf dumper
stays attached until its bpfdumpfs entry is deleted.
Apart from BPF_DUMPER_OPEN_FILE and open()'ing bpfdumpfs file duality,
it seems pretty consistent and follows safe-by-default auto-cleanup of
anonymous link, unless pinned in bpfdumpfs (or one can still pin
bpf_link in bpffs, but it can't be open()'ed the same way, it just
preserves BPF program from being cleaned up).
Out of all schemes I could come up with, this one seems most unified
and nicely fits into bpf_link infra. Thoughts?
To facilitate target seq_ops->show() to get the
bpf program easily, dumper creation increased
the target-provided seq_file private data size
so bpf program pointer is also stored in seq_file
private data.
Further, a seq_num which represents how many
bpf_dump_get_prog() has been called is also
available to the target seq_ops->show().
Such information can be used to e.g., print
banner before printing out actual data.
Note the seq_num does not represent the num
of unique kernel objects the bpf program has
seen. But it should be a good approximate.
A target feature BPF_DUMP_SEQ_NET_PRIVATE
is implemented specifically useful for
net based dumpers. It sets net namespace
as the current process net namespace.
This avoids changing existing net seq_ops
in order to retrieve net namespace from
the seq_file pointer.
For open dumper files, anonymous or not, the
fdinfo will show the target and prog_id associated
with that file descriptor. For dumper file itself,
a kernel interface will be provided to retrieve the
prog_id in one of the later patches.
Signed-off-by: Yonghong Song <yhs@xxxxxx>
---
include/linux/bpf.h | 5 +
include/uapi/linux/bpf.h | 6 +-
kernel/bpf/dump.c | 338 ++++++++++++++++++++++++++++++++-
kernel/bpf/syscall.c | 11 +-
tools/include/uapi/linux/bpf.h | 6 +-
5 files changed, 362 insertions(+), 4 deletions(-)
[...]