Re: [PATCH bpf-next v2 4/4] bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()

Yonghong Song <yhs@xxxxxx> · Wed, 26 Feb 2020 22:16:16 -0800

On 2/26/20 5:34 PM, Martin KaFai Lau wrote:
On Wed, Feb 26, 2020 at 06:21:33PM +0100, Daniel Borkmann wrote:
On 2/26/20 12:04 AM, Martin KaFai Lau wrote:
This patch will dump out the bpf_sk_storages of a sk
if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.

An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.

bpf_sk_storages can be added to the system at runtime.  It is difficult
to find a proper static value for cb->min_dump_alloc.

This patch learns the nlattr size required to dump the bpf_sk_storages
of a sk.  If it happens to be the very first nlmsg of a dump and it
cannot fit the needed bpf_sk_storages,  it will try to expand the
skb by "pskb_expand_head()".

Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
be used.  In __inet_diag_dump(), it will retry as long as the
skb is empty and the cb->min_dump_alloc becomes larger than before.
cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
is also changed from 'u16' to 'u32' to accommodate a sk that may have
a few large bpf_sk_storages.

The updated cb->min_dump_alloc will also be used to allocate the skb in
the next dump.  This logic already exists in netlink_dump().

Here is the sample output of a locally modified 'ss' and it could be made
more readable by using BTF later:
[root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
ESTAB 0      0              [::1]:51072        [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
ESTAB 0      0              [::1]:51070        [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]

[root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
ESTAB         0              0                                [::1]:51072                          [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
ESTAB         0              0                                [::1]:51070                          [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]

Acked-by: Song Liu <songliubraving@xxxxxx>
Signed-off-by: Martin KaFai Lau <kafai@xxxxxx>

Hmm, the whole approach is not too pleasant to be honest. I can see why you need
it since the regular sk_storage lookup only takes sock fd as a key and you don't
have it otherwise available from outside, but then dumping up to KMALLOC_MAX_SIZE
via netlink skb is not a great experience either. :( Also, are we planning to add
the BTF dump there in addition to bpftool? Thus resulting in two different lookup
APIs and several tools needed for introspection instead of one? :/ Also, how do we
dump task local storage maps in future? Does it need a third lookup interface?

In your commit logs I haven't read on other approaches and why they won't work;
I was wondering, given sockets are backed by inodes, couldn't we have a variant
of iget_locked() (minus the alloc_inode() part from there) where you pass in ino
number to eventually get to the socket and then dump the map value associated with
it the regular way from bpf() syscall?
Thanks for the feedback!

I think (1) dumping all sk(s) in a system is different from
(2) dumping all sk of a bpf_sk_storage_map or lookup a particular
sk from a bpf_sk_storage_map.

This patch is doing (1).  I believe it is useful to make the commonly used
tools like "ss" (which already shows many useful information of a sk)
to be able to introspect a kernel struct extended by bpf instead of
limiting to only the bpftool can show the bpf extended data.
The plan is to move the bpftool/btf_dumper.c to libbpf.  The libbpf's
btf_dumper print out format is still TBD and the current target is the drgn
like format instead of the current semi-json like plain-txt printout.  As
more kernel struct may be extensible by bpf, having it in libbpf will be
useful going forward.

Re: future kernel struct extended by bpf
For doing (1), I think we can ride on the existing API to iterate them also.
bpf can extend a kernel struct but should not stop the current
iteration API from working or seeing them.  That includes the current
seq_file API.  The mid-term goal is to extend the seq_file by
attaching a bpf_prog to (e.g. /proc/net/tcp) to do filtering
and printing.  Yonghong is looking into that.

Right. I am working on a design and prototype to use bpf programs
to dump certain kernel internal data structures. The high level
idea is to create some /sys path to encode "what" to dump, e.g.,
/sys/kernel/bpfdump/tcp6_sockets/ for tcp6_sockets and "add" bpf
program(s) to that path to specify "how" to dump.

I hope to share the design and prototype soon.