Re: [RFC PATCH bpf-next v2 00/17] bpf: implement bpf based dumping of kernel data structures

Alan Maguire <alan.maguire@xxxxxxxxxx> · Fri, 17 Apr 2020 16:02:09 +0100 (BST)

On Wed, 15 Apr 2020, Yonghong Song wrote:

> 
> 
> On 4/15/20 7:23 PM, David Ahern wrote:
> > On 4/15/20 1:27 PM, Yonghong Song wrote:
> >>
> >> As there are some discussions regarding to the kernel interface/steps to
> >> create file/anonymous dumpers, I think it will be beneficial for
> >> discussion with this work in progress.
> >>
> >> Motivation:
> >>    The current way to dump kernel data structures mostly:
> >>      1. /proc system
> >>      2. various specific tools like "ss" which requires kernel support.
> >>      3. drgn
> >>    The dropback for the first two is that whenever you want to dump more,
> >>    you
> >>    need change the kernel. For example, Martin wants to dump socket local
> > 
> > If kernel support is needed for bpfdump of kernel data structures, you
> > are not really solving the kernel support problem. i.e., to dump
> > ipv4_route's you need to modify the relevant proc show function.
> 
> Yes, as mentioned two paragraphs below. kernel change is required.
> The tradeoff is that this is a one-time investment. Once kernel change
> is in place, printing new fields (in most cases except new fields
> which need additional locks etc.) no need for kernel change any more.
>

One thing I struggled with initially when reading the cover
letter was understanding how BPF dumper programs get run.
Patch 7 deals with that I think and the answer seems to be to
create additional seq file infrastructure to the exisiting
one which executes the BPF dumper programs where appropriate.
Have I got this right? I guess more lightweight methods
such as instrumenting functions associated with an existing /proc
dumper are a bit too messy?

Thanks!

Alan

> > 
> > 
> >>    storage with "ss". Kernel change is needed for it to work ([1]).
> >>    This is also the direct motivation for this work.
> >>
> >>    drgn ([2]) solves this proble nicely and no kernel change is not needed.
> >>    But since drgn is not able to verify the validity of a particular
> >>    pointer value,
> >>    it might present the wrong results in rare cases.
> >>
> >>    In this patch set, we introduce bpf based dumping. Initial kernel
> >>    changes are
> >>    still needed, but a data structure change will not require kernel
> >>    changes
> >>    any more. bpf program itself is used to adapt to new data structure
> >>    changes. This will give certain flexibility with guaranteed correctness.
> >>
> >>    Here, kernel seq_ops is used to facilitate dumping, similar to current
> >>    /proc and many other lossless kernel dumping facilities.
> >>
> >> User Interfaces:
> >>    1. A new mount file system, bpfdump at /sys/kernel/bpfdump is
> >>    introduced.
> >>       Different from /sys/fs/bpf, this is a single user mount. Mount
> >>       command
> >>       can be:
> >>          mount -t bpfdump bpfdump /sys/kernel/bpfdump
> >>    2. Kernel bpf dumpable data structures are represented as directories
> >>       under /sys/kernel/bpfdump, e.g.,
> >>         /sys/kernel/bpfdump/ipv6_route/
> >>         /sys/kernel/bpfdump/netlink/
> > 
> > The names of bpfdump fs entries do not match actual data structure names
> > - e.g., there is no ipv6_route struct. On the one hand that is a good
> > thing since structure names can change, but that also means a mapping is
> > needed between the dumper filesystem entries and what you get for context.
> 
> Yes, the later bpftool patch implements a new command to dump such
> information.
> 
>   $ bpftool dumper show target
>   target                  prog_ctx_type
>   task                    bpfdump__task
>   task/file               bpfdump__task_file
>   bpf_map                 bpfdump__bpf_map
>   ipv6_route              bpfdump__ipv6_route
>   netlink                 bpfdump__netlink
> 
> in vmlinux.h generated by vmlinux BTF, we have
> 
> struct bpf_dump_meta {
>         struct seq_file *seq;
>         u64 session_id;
>         u64 seq_num;
> };
> 
> struct bpfdump__ipv6_route {
>         struct bpf_dump_meta *meta;
>         struct fib6_info *rt;
> };
> 
> Here, bpfdump__ipv6_route is the bpf program context type.
> User can based on this to write the bpf program.
> 
> > 
> > Further, what is the expectation in terms of stable API for these fs
> > entries? Entries in the context can change. Data structure names can
> > change. Entries in the structs can change. All of that breaks the idea
> > of stable programs that are compiled once and run for all future
> > releases. When structs change, those programs will break - and
> > structures will change.
> 
> Yes, the API (ctx) we presented to bpf program is indeed unstable.
> CO-RE should help to certain extend but if some fields are gone, e.g.,
> bpf program will need to be rewritten for that particular kernel version, or
> kernel bpfdump infrastructure can be enhanced to
> change its ctx structure to have more information to the program
> for that kernel version. In summary, I agree with you that this is
> an unstable API similar to other tracing program
> since it accesses kernel internal data structures.
> 
> > 
> > What does bpfdumper provide that you can not do with a tracepoint on a
> > relevant function and then putting a program on the tracepoint? ie., why
> > not just put a tracepoint in the relevant dump functions.
> 
> In my very beginning to explore bpfdump, kprobe to "show" function is
> one of options. But quickly we realized that we actually do not want
> to just piggyback on "show" function, but want to replace it with
> bpf. This will be useful in following different use cases:
>   1. first catable dumper file, similar to /proc/net/ipv6_route,
>      we want /sys/kernel/bpfdump/ipv6_route/my_dumper and you can cat
>      to get it.
> 
>      Using kprobe when you are doing `cat /proc/net/ipv6_route`
>      is complicated.  You probably need an application which
>      runs through `cat /proc/net/ipv6_route` and discard its output,
>      and at the same time gets the result from bpf program
>      (filtered by pid since somebody may run
>      `cat /proc/net/ipv6_route` at the same time. You may use
>      perf ring_buffer to send the result back to the application.
> 
>      note that perf ring buffer may lose records for whatever
>      reason and seq_ops are implemented not to lose records
>      by built-in retries.
> 
>      Using kprobe approach above is complicated and for each dumper
>      you need an application. We would like it to be just catable
>      with minimum user overhead to create such a dumper.
> 
>   2. second, anonymous dumper, kprobe/tracepoint will incur
>      original overhead of seq_printf per object. but user may
>      be only interested in a very small portion of information.
>      In such cases, bpf program directly doing filtering in
>      the kernel can potentially speed up a lot if there are a lot of
>      records to traverse.
> 
>   3. for data structures which do not have catable dumpers
>      for example task, hopefully, as demonstrated in this patch set,
>      kernel implementation and writing a bpf program are not
>      too hard. This especially enables people to do in-kernel
>      filtering which is the strength of the bpf.
> 
> 
>