Re: Struct_ops Questions

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Tue, 17 Jan 2023 11:05:57 -0800

On 1/13/23 5:05 PM, Daniel Rosenberg wrote:
I'm currently working on switching Fuse-BPF over to using struct_ops,
and have a few questions.

I noticed that the lifespan of the [name]_struct_op struct passed in
reg, and the associated maps have a lifespan that can't be influenced
by the subsystem. Fuse-bpf keeps struct ops on an arbitrary number of
inodes which will potentially outlive the op structure. My current
plan is to have fuse treat unreg similar to if the daemon simply
stopped responding, failing all impacted calls with ENOTCONN. I'm
currently looking at two approaches.

1. Copy the struct received in reg, and have a flag to indicate if the
structure is live, with unreg just marking it as dead and some RCU
style sync to ensure we don't lose the function pointers post check.
2. Maintain an additional struct that points to the reg provided
struct. Null out that pointer in unreg, with the same sort of RCU
sync.

I'm currently leaning towards 1, but not sure what the best approach
is here, or if there should be some way for the subsystem to grab a
kref on the maps/struct_op structures. Given the analogue of the FUSE
daemon being able to disappear at any given time, I think one of the
above options should be enough.

The old fuse-bpf implementation, which had its own program type
defined, would use the program fd as an identifier, which allowed it
to increment the ref count as needed. That sort of information isn't
exposed to the register function, and you can't reach the struct_ops
structure from a map fd as is. Either of those would allow us to
directly use the map/struct objects without needing to maintain an
extra layer of duplicated data. Currently all the register function
does is add information to be able to check if the map still exists,
which wouldn't be needed if we could just grab a ref.

I'm not sure how to handle Fuse being built as a module. Currently,
bpf_struct_ops uses an include file list to define all available
struct_ops, along with externs for their bpf_struct_ops structure. If
we build fuse as a module, that either would not be available, or
would require us to build extra things into the kernel when we make
fuse as a module, which sort of seems to defeat the point.

Do we need a module interface here? At the moment I'm messing around
with just having the reg/unreg functions implemented within the FUSE
module, with all of the verification functions built in on the bpf
side. I've got a rough prototype working, but there's some messiness
around unloading the module while there are still struct_op programs
registered. If you unload and reload the module, the struct_op program
will still show up via bpftools, but would be unusable since it would
no longer be registered. All of that goes away if we can directly use
the map fd as an identifier. That wouldn't be useful for anything that
requires extra setup in their reg function of course.

It seems like a fair bit of these issues go away if we allow for a way
to grab the specific struct_op structure from the map fd. Would that
be a reasonable thing to expose to a module?

I think the first question is related to the refcnt of the struct_ops. Whenever 
a new tcp connection is established, it also needs to bump the refcnt of a (bpf) 
tcp_congestion_ops. The tcp subsystem currently does a bpf_try_module_get() 
which then calls bpf_struct_ops_get() if 'owner == BPF_MODULE_OWNER'. 
bpf_struct_ops_get() will inc the refcnt of the struct_ops.

Regarding unreg, the st_map->st_ops->unreg() will unregister the struct_ops from 
the tcp subsystem. The future tcp connection won't be able to find this (bpf) 
tcp_congestion_ops.

After unreg(), the existing tcp connections could still hold a refcnt to the 
already unregistered (bpf) tcp_congestion_ops. The refcnt will eventually be 
released when all these old connections are closed. This is similar to all other 
kernel tcp-cc modules (eg. when tcp_dctcp.c is built as a module).  Note that 
once unreg, the struct_ops will still be shown in 'bpftools struct_ops dump 
id...' but the status will be in BPF_STRUCT_OPS_STATE_TOBEFREE and it won't be 
usable by new connection.

Does the above behavior work for FUSE also? If not, how is FUSE different?

I think the next set of question is about when FUSE itself is built as a module. 
The first is how to register 'struct bpf_struct_ops bpf_fuse_struct_ops'. You 
are correct that right now it is done during compile time in 
bpf_struct_ops_types.h. To make FUSE itself built as a module and 
bpf_fuse_struct_ops defined in the FUSE module, some work is needed in 
bpf_struct_ops.c to allow registering struct_ops in runtime during module_init. 
For module reference counting, I think the bpf_fuse_struct_ops should be able to 
hold one refcnt of the fuse module. Not sure how bpf_fuse_struct_ops looks like 
and I also don't know how grabbing the map fd will help. It seems you already 
have something working, so it may be easier to discuss base on that.