Re: [PATCH bpf-next v3 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

Martin Lau <kafai@xxxxxx> · Wed, 8 Jan 2020 18:41:40 +0000

On Wed, Jan 08, 2020 at 05:53:50PM +0100, Daniel Borkmann wrote:
> On Wed, Jan 08, 2020 at 01:52:26AM +0000, Martin Lau wrote:
> > On Wed, Jan 08, 2020 at 01:21:39AM +0100, Daniel Borkmann wrote:
> > > On 12/31/19 7:20 AM, Martin KaFai Lau wrote:
> > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > > > is a kernel struct with its func ptr implemented in bpf prog.
> > > > This new map is the interface to register/unregister/introspect
> > > > a bpf implemented kernel struct.
> > > > 
> > > > The kernel struct is actually embedded inside another new struct
> > > > (or called the "value" struct in the code).  For example,
> > > > "struct tcp_congestion_ops" is embbeded in:
> > > > struct bpf_struct_ops_tcp_congestion_ops {
> > > > 	refcount_t refcnt;
> > > > 	enum bpf_struct_ops_state state;
> > > > 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > > > }
> > > > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > > > The "bpftool map dump" will then be able to show the
> > > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > > > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > > > is created automatically by a macro.  Having a separate "value" struct
> > > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > > > initialization works before registering the struct_ops to the kernel
> > > > subsystem).  The libbpf will take care of finding and populating the
> > > > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> > > > 
> > > > Register a struct_ops to a kernel subsystem:
> > > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> > > >     set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> > > >     running kernel.
> > > >     Instead of reusing the attr->btf_value_type_id,
> > > >     btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> > > >     used as the "user" btf which could store other useful sysadmin/debug
> > > >     info that may be introduced in the furture,
> > > >     e.g. creation-date/compiler-details/map-creator...etc.
> > > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> > > >     in the running kernel btf.  Populate the value of this object.
> > > >     The function ptr should be populated with the prog fds.
> > > > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> > > >     the map value.  The key is always "0".
> > > > 
> > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > > > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > > > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > > > (e.g. ensure all mandatory func ptrs are implemented).
> > > > If everything looks good, it will register this kernel struct
> > > > to the kernel subsystem.  The map will not allow further update
> > > > from this point.
> > > 
> > > Btw, did you have any thoughts on whether it would have made sense to add
> > > a new core construct for BPF aside from progs or maps, e.g. BPF modules
> > > which then resemble a collection of progs/ops (given this would not be limited
> > > to tcp congestion control only). Given the possibilities, having a bit of second
> > > thoughts on abusing BPF map interface this way which is not overly pretty. It's
> > > not a map anymore at this point anyway, we're just reusing the syscall interface
> > > since it's convenient though cannot be linked to any prog is just a single slot
> > > etc, but technically some sort of BPF module registration would be nicer. Also in
> > > terms of 'bpftool modules' then listing all such currently loaded modules which
> > > need to be cleaned up this way through explicit removal (similar to insmod/
> > > lsmod/rmmod); at least feels more natural conceptually than BPF maps and the way
> > > you refcount them, and would perhaps also be a fit for BPF lib helpers for dynamic
> > > linking to load that way. So essentially similar but more lightweight infrastructure
> > > as with kernel modules. Thoughts?
> > Inventing a new bpf obj type (vs adding new map type like in this patch) was
> > one considered (and briefly-tried) option.
> > 
> > Once BTF was introduced to bpf map,  I see bpf map as an introspectible
> > bpf obj that can store any blob described by BTF.  I don't think
> > creating a new bpf obj type worth it while both of them are basically
> > storing a value described by BTF.
> > 
> > I did try to create register/unregister interface and new bpf-cmd.
> > At the end, it ends up very similar to update_elem() which is basically
> > updating a blob of a struct described by BTF.  Hence, I tossed that and
> > came back to the current approach.
> > 
> > Put aside the new bpf obj type needs kernel support like another idr,
> > likely pin-able, fd, get_info...etc,  I suspect most users have already
> > been used to do 'bpftool map dump' to introspect bpf obj that is storing
> > a 'struct'.
> > 
> > The map type is enough to distinguish the map usage instead of creating
> > another bpf obj type.  The 'bpftool modules' will work on the struct_ops
> > map only.
> 
> Right, but under long-term I'd expect more users of this interface and given
> we abuse the map only to keep other entities (here: bpf tcp congctl module)
> 'alive', but cannot do anything else with this map (as in: usage in the BPF
> program),
For now, yes.  In the future, a bpf_prog may want to switch to another
bpf-tcp-cc (could be by looking it up from map-in-map also).  I do not
mean there is an immediate usecase but it is good to keep this
flexibility.

> it feels that this begs for a better interface. Given we need an
> explicit delete operation of the map slot in order to eventually unregister
> the congctl module once no application is using it anymore, how are users
> supposed to operate this considering the loader performs either only a load
> or crashes before the map delete happens? If you had 'bpftool modules' like
> cmdline interface with similar insmod/lsmod/rmmod type operation as we have
> for kernel modules, it's pretty obvious and intuitive. Here, you'd need a
> 'bpftool map dump' to get to the concrete ops map, and then perform an
> explicit delete operation for releasing the ops refcount and thus to unload
> the set of progs. Such extension for bpftool should be done regardless, even
A new bpftool command to operate on struct_ops map alone is in the pipeline.

The first thing though is to improve bpftool to recognize
btf_vmlinux_value_type_id which could be useful in the future
maps that also store a kernel's struct.

Regarding 'bpftool map show' first to figure out which
'struct_ops' map to delete,  the same is also true for lsmod/rmmod.
I also usually do lsmod to figure out which one I am looking at first
before issuing rmmod.  I suspect even the same lookup and then
delete/rmmod operation will still have to be done for the future
'bpftool modules (or struct_ops)' command.

> if we end up to keep abusing the map interface for this, but API wise feels
> way cleaner to have a dedicated register/unregister interface.
Other than the BPF syscall command name difference, lets explore how
would register/unregister be different from update/delete.

The first attempt I did on BPF_STRUCT_OPS_REGISTER is to do update alone
which ends-up very close to BPF_MAP_UPDATE_ELEM.

The second attempt I did on register is to do map-create and map-update
together and then return a fd.  However, I still don't see enough
benefit that deserves a separate BPF command to just combine these
two.  The global .rodata map is also map-create, map-update,
and map-freeze which technically it can do all of them under
a new command.

If update-vs-register looks the same, it then logically follows
update-then-delete.  For BPF_STRUCT_OPS_UNREGISTER, I also do not
see any difference except the key can be avoided.  Also, the
struct_ops map has a btf_key_type_id 0 which is "void" and
it is a clean interface to tell the map is key-less.