Re: [PATCH bpf-next v2 4/4] bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()

Martin KaFai Lau <kafai@xxxxxx> · Wed, 26 Feb 2020 17:34:48 -0800

On Wed, Feb 26, 2020 at 06:21:33PM +0100, Daniel Borkmann wrote:
> On 2/26/20 12:04 AM, Martin KaFai Lau wrote:
> > This patch will dump out the bpf_sk_storages of a sk
> > if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.
> > 
> > An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
> > INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
> > If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.
> > 
> > bpf_sk_storages can be added to the system at runtime.  It is difficult
> > to find a proper static value for cb->min_dump_alloc.
> > 
> > This patch learns the nlattr size required to dump the bpf_sk_storages
> > of a sk.  If it happens to be the very first nlmsg of a dump and it
> > cannot fit the needed bpf_sk_storages,  it will try to expand the
> > skb by "pskb_expand_head()".
> > 
> > Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
> > sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
> > be used.  In __inet_diag_dump(), it will retry as long as the
> > skb is empty and the cb->min_dump_alloc becomes larger than before.
> > cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
> > is also changed from 'u16' to 'u32' to accommodate a sk that may have
> > a few large bpf_sk_storages.
> > 
> > The updated cb->min_dump_alloc will also be used to allocate the skb in
> > the next dump.  This logic already exists in netlink_dump().
> > 
> > Here is the sample output of a locally modified 'ss' and it could be made
> > more readable by using BTF later:
> > [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
> > State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
> > ESTAB 0      0              [::1]:51072        [::1]:8989
> > 	 bpf_map_id:14 value:[ 3feb ]
> > 	 bpf_map_id:13 value:[ 3f ]
> > ESTAB 0      0              [::1]:51070        [::1]:8989
> > 	 bpf_map_id:14 value:[ 3feb ]
> > 	 bpf_map_id:13 value:[ 3f ]
> > 
> > [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
> > State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
> > ESTAB         0              0                                [::1]:51072                          [::1]:8989
> > 	 bpf_map_id:14 value:[ 3feb ]
> > 	 bpf_map_id:13 value:[ 3f ]
> > 	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
> > ESTAB         0              0                                [::1]:51070                          [::1]:8989
> > 	 bpf_map_id:14 value:[ 3feb ]
> > 	 bpf_map_id:13 value:[ 3f ]
> > 	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
> > 
> > Acked-by: Song Liu <songliubraving@xxxxxx>
> > Signed-off-by: Martin KaFai Lau <kafai@xxxxxx>
> 
> Hmm, the whole approach is not too pleasant to be honest. I can see why you need
> it since the regular sk_storage lookup only takes sock fd as a key and you don't
> have it otherwise available from outside, but then dumping up to KMALLOC_MAX_SIZE
> via netlink skb is not a great experience either. :( Also, are we planning to add
> the BTF dump there in addition to bpftool? Thus resulting in two different lookup
> APIs and several tools needed for introspection instead of one? :/ Also, how do we
> dump task local storage maps in future? Does it need a third lookup interface?
> 
> In your commit logs I haven't read on other approaches and why they won't work;
> I was wondering, given sockets are backed by inodes, couldn't we have a variant
> of iget_locked() (minus the alloc_inode() part from there) where you pass in ino
> number to eventually get to the socket and then dump the map value associated with
> it the regular way from bpf() syscall?
Thanks for the feedback!

I think (1) dumping all sk(s) in a system is different from
(2) dumping all sk of a bpf_sk_storage_map or lookup a particular
sk from a bpf_sk_storage_map.

This patch is doing (1).  I believe it is useful to make the commonly used
tools like "ss" (which already shows many useful information of a sk)
to be able to introspect a kernel struct extended by bpf instead of
limiting to only the bpftool can show the bpf extended data.
The plan is to move the bpftool/btf_dumper.c to libbpf.  The libbpf's
btf_dumper print out format is still TBD and the current target is the drgn
like format instead of the current semi-json like plain-txt printout.  As
more kernel struct may be extensible by bpf, having it in libbpf will be
useful going forward.

Re: future kernel struct extended by bpf
For doing (1), I think we can ride on the existing API to iterate them also.
bpf can extend a kernel struct but should not stop the current
iteration API from working or seeing them.  That includes the current
seq_file API.  The mid-term goal is to extend the seq_file by
attaching a bpf_prog to (e.g. /proc/net/tcp) to do filtering
and printing.  Yonghong is looking into that.

Re: lookup a bpf_sk_storage_map by socket fd and KMALLOC_MAX_SIZE
In my config,
sizeof(tcp_sock) is    2076
1 MAX sk_storage is   65407 (31x of a tcp_sock)
KMALLOC_MAX_SIZE is 4194304 (2000x of a tcp_sock)

It is a lot of data to extend on a tcp_sock.
The total bpf_sk_storages that a sk could have is further bounded by
net.core.optmem_max.  It is a very unusual setup to have a sk
that has so many bpf_sk_storage data that a skb cannot
accommodate.

That said, the current sk_storage_update() does not limit the total
bpf_sk_storages of a sk to KMALLOC_MAX_SIZE.

I think this limit could be added to the update side now and the chance
of breaking is very minimal.

or the netlink can return map_id only when the max-sized skb cannot fit
all the bpf_sk_storages.  The userspace then do another syscall to
lookup the data from each individual bpf_sk_storage_map and
that requires to lookup side support with another key (non-fd).
IMO, it is weird and a bit opposite of what bpf_sk_storage should be (fast
bpf_sk_storage lookup while holding a sk).  The iteration API already
holds the sk but instead it is asking the usespace to go back to find
out the sk again in order to get the bpf_sk_storages.  I think that
should be avoided if possible.

Regarding i_ino, after looking at sock_alloc() and get_next_ino(),
hmmm...is it unique?
If it is, what is the different usecase between i_ino and
sk->sk_cookie?