On Wed, 15 Apr 2020, Yonghong Song wrote: > > > On 4/15/20 7:23 PM, David Ahern wrote: > > On 4/15/20 1:27 PM, Yonghong Song wrote: > >> > >> As there are some discussions regarding to the kernel interface/steps to > >> create file/anonymous dumpers, I think it will be beneficial for > >> discussion with this work in progress. > >> > >> Motivation: > >> The current way to dump kernel data structures mostly: > >> 1. /proc system > >> 2. various specific tools like "ss" which requires kernel support. > >> 3. drgn > >> The dropback for the first two is that whenever you want to dump more, > >> you > >> need change the kernel. For example, Martin wants to dump socket local > > > > If kernel support is needed for bpfdump of kernel data structures, you > > are not really solving the kernel support problem. i.e., to dump > > ipv4_route's you need to modify the relevant proc show function. > > Yes, as mentioned two paragraphs below. kernel change is required. > The tradeoff is that this is a one-time investment. Once kernel change > is in place, printing new fields (in most cases except new fields > which need additional locks etc.) no need for kernel change any more. > One thing I struggled with initially when reading the cover letter was understanding how BPF dumper programs get run. Patch 7 deals with that I think and the answer seems to be to create additional seq file infrastructure to the exisiting one which executes the BPF dumper programs where appropriate. Have I got this right? I guess more lightweight methods such as instrumenting functions associated with an existing /proc dumper are a bit too messy? Thanks! Alan > > > > > >> storage with "ss". Kernel change is needed for it to work ([1]). > >> This is also the direct motivation for this work. > >> > >> drgn ([2]) solves this proble nicely and no kernel change is not needed. > >> But since drgn is not able to verify the validity of a particular > >> pointer value, > >> it might present the wrong results in rare cases. > >> > >> In this patch set, we introduce bpf based dumping. Initial kernel > >> changes are > >> still needed, but a data structure change will not require kernel > >> changes > >> any more. bpf program itself is used to adapt to new data structure > >> changes. This will give certain flexibility with guaranteed correctness. > >> > >> Here, kernel seq_ops is used to facilitate dumping, similar to current > >> /proc and many other lossless kernel dumping facilities. > >> > >> User Interfaces: > >> 1. A new mount file system, bpfdump at /sys/kernel/bpfdump is > >> introduced. > >> Different from /sys/fs/bpf, this is a single user mount. Mount > >> command > >> can be: > >> mount -t bpfdump bpfdump /sys/kernel/bpfdump > >> 2. Kernel bpf dumpable data structures are represented as directories > >> under /sys/kernel/bpfdump, e.g., > >> /sys/kernel/bpfdump/ipv6_route/ > >> /sys/kernel/bpfdump/netlink/ > > > > The names of bpfdump fs entries do not match actual data structure names > > - e.g., there is no ipv6_route struct. On the one hand that is a good > > thing since structure names can change, but that also means a mapping is > > needed between the dumper filesystem entries and what you get for context. > > Yes, the later bpftool patch implements a new command to dump such > information. > > $ bpftool dumper show target > target prog_ctx_type > task bpfdump__task > task/file bpfdump__task_file > bpf_map bpfdump__bpf_map > ipv6_route bpfdump__ipv6_route > netlink bpfdump__netlink > > in vmlinux.h generated by vmlinux BTF, we have > > struct bpf_dump_meta { > struct seq_file *seq; > u64 session_id; > u64 seq_num; > }; > > struct bpfdump__ipv6_route { > struct bpf_dump_meta *meta; > struct fib6_info *rt; > }; > > Here, bpfdump__ipv6_route is the bpf program context type. > User can based on this to write the bpf program. > > > > > Further, what is the expectation in terms of stable API for these fs > > entries? Entries in the context can change. Data structure names can > > change. Entries in the structs can change. All of that breaks the idea > > of stable programs that are compiled once and run for all future > > releases. When structs change, those programs will break - and > > structures will change. > > Yes, the API (ctx) we presented to bpf program is indeed unstable. > CO-RE should help to certain extend but if some fields are gone, e.g., > bpf program will need to be rewritten for that particular kernel version, or > kernel bpfdump infrastructure can be enhanced to > change its ctx structure to have more information to the program > for that kernel version. In summary, I agree with you that this is > an unstable API similar to other tracing program > since it accesses kernel internal data structures. > > > > > What does bpfdumper provide that you can not do with a tracepoint on a > > relevant function and then putting a program on the tracepoint? ie., why > > not just put a tracepoint in the relevant dump functions. > > In my very beginning to explore bpfdump, kprobe to "show" function is > one of options. But quickly we realized that we actually do not want > to just piggyback on "show" function, but want to replace it with > bpf. This will be useful in following different use cases: > 1. first catable dumper file, similar to /proc/net/ipv6_route, > we want /sys/kernel/bpfdump/ipv6_route/my_dumper and you can cat > to get it. > > Using kprobe when you are doing `cat /proc/net/ipv6_route` > is complicated. You probably need an application which > runs through `cat /proc/net/ipv6_route` and discard its output, > and at the same time gets the result from bpf program > (filtered by pid since somebody may run > `cat /proc/net/ipv6_route` at the same time. You may use > perf ring_buffer to send the result back to the application. > > note that perf ring buffer may lose records for whatever > reason and seq_ops are implemented not to lose records > by built-in retries. > > Using kprobe approach above is complicated and for each dumper > you need an application. We would like it to be just catable > with minimum user overhead to create such a dumper. > > 2. second, anonymous dumper, kprobe/tracepoint will incur > original overhead of seq_printf per object. but user may > be only interested in a very small portion of information. > In such cases, bpf program directly doing filtering in > the kernel can potentially speed up a lot if there are a lot of > records to traverse. > > 3. for data structures which do not have catable dumpers > for example task, hopefully, as demonstrated in this patch set, > kernel implementation and writing a bpf program are not > too hard. This especially enables people to do in-kernel > filtering which is the strength of the bpf. > > >