Re: [PATCH] bpf: verifier: allow arrays of progs to be used in sleepable context

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 8 May 2024 15:25:31 -0700

On Wed, May 8, 2024 at 4:53 AM Benjamin Tissoires <bentiss@xxxxxxxxxx> wrote:
>
> On May 07 2024, Alexei Starovoitov wrote:
> > On Tue, May 7, 2024 at 6:32 AM Benjamin Tissoires <bentiss@xxxxxxxxxx> wrote:
> > >
> > > Yes, exactly that. See [0] for my current WIP. I've just sent it, not
> > > for reviews, but so you see what I meant here.
> >
> > The patches helped to understand, for sure, and on surface
> > they kind of make sense, but without seeing what is that
> > hid specific kfunc that will use it
> > it's hard to make a call.
>
> I've posted my HID WIP on [1]. It probably won't compile as my local
> original branch was having a merge of HID and bpf trees.

Thanks for this context.
Now it makes a lot more sense.
And the first patches look fine and simplification is impressive,
but this part:
+     SEC("syscall")
+     int name(struct attach_prog_args *ctx)
+     {
+       ctx->retval = hid_bpf_attach_prog_impl(ctx->hid,
+                                                 type,
+                                                 __##name,
+                           ctx->insert_head ? HID_BPF_FLAG_INSERT_HEAD :
+                                                 HID_BPF_FLAG_NONE,
+                                                 NULL);

is too odd.
Essentially you're adding a kfunc on hid side just to call it
from a temporary "syscall" program to register another prog.
A fake prog just to call a kfunc is a bit too much.

The overall mechanism is pretty much a customized struct-ops.

I think struct-ops infra provides better api-s, safety guarantees,
bpf_link support, prog management including reentrance check, etc.
It needs to be parametrized, so it's not just
SEC("struct_ops/kern_callback_name")
so that the skeleton loading phase can pass device id or something.

> > The (u64)(long) casting concerns and prog lifetime are
> > difficult to get right. The verifier won't help and it all
> > will fall on code reviews.
>
> yeah, this is a concern.

Not only that. The special kfunc does migrate_disable
before calling callback, but it needs rcu_lock or tracing lock,
plus reentrance checks.

>
> > So I'd rather not go this route.
> > Let's explore first what exactly the goal here.
> > We've talked about sleepable tail_calls, this async callbacks
> > from hid kfuncs, and struct-ops.
> > Maybe none of them fit well and we need something else.
> > Could you please explain (maybe once again) what is the end goal?
>
> right now I need 4 hooks in HID, the first 2 are already upstream:
> - whenever I need to retrieve the report descriptor (this happens in a
>   sleepable context, but non sleepable is fine)
> - whenever I receive an event from a device (non sleepable context, this
>   is coming from a hard IRQ context)
> - whenever someone tries to write to the device through
>   hid_hw_raw_request (original context is sleepable, and for being able
>   to communicate with the device we need sleepable context in bpf)
> - same but from hid_hw_output_report
>
> Again, the first 2 are working just fine.
>
> Implementing the latter twos requires sleepable context because we
> might:
>
> 1. a request is made from user-space
> 2. we jump into hid-bpf
> 3. the bpf program "converts" the request from report ID 1 to 2 (because
> we export a slightly different API)
> 4. the bpf program directly emits hid_bpf_raw_request (sleepable
> operation)
> 5. the bpf program returns the correct value
> 6. hid-core doesn't attempt to communicate with the device as bpf
> already did.
>
> In the series, I also realized that I need sleepable and non sleepable
> contexts for this kind of situation, because I want tracing and
> firewalling available (non sleepable context), while still allowing to
> communicate with the device. But when you communicate with the device
> from bpf, the sleepable bpf program is not invoked or this allows
> infinite loops.

I don't get the point about infinite loops.
fyi struct_ops already supports sleepable and non-sleepable callbacks.
See progs/dummy_st_ops_success.c
SEC(".struct_ops")
struct bpf_dummy_ops dummy_1 = {
        .test_1 = (void *)test_1,
        .test_2 = (void *)test_2,
        .test_sleepable = (void *)test_sleepable,
};

two callbacks are normal and another one is sleepable.

The generated bpf trampoline will have the right
__bpf_prog_enter* wrappers for all 3 progs,
so the kernel code will be just do ops->callback_name().

> >
> > > Last time I checked, I thought struct_ops were only for defining one set
> > > of operations. And you could overwrite them exactly once.
> > > But after reading more carefully how it was used in tcp_cong.c, it seems
> > > we can have multiple programs which define the same struct_ops, and then
> > > it's the kernel which will choose which one needs to be run.
> >
> > struct-ops is pretty much a mechanism for kernel to define
> > a set of callbacks and bpf prog to provide implementation for
> > these callbacks. The kernel choses when to call them.
> > tcp-bpf is one such user. sched_ext is another and more advanced.
> > Currently struct-ops bpf prog loading/attaching mechanism
> > only specifies the struct-ops. There is no device-id argument,
> > but that can be extended and kernel can keep per-device a set
> > of bpf progs.
> > struct-ops is a bit of overkill if you have only one callback.
> > It's typically for a set of callbacks.
>
> In the end I have 4. However, I might have programs that overwrite twice
> the same callback (see the 2 SEC("fmod_ret/hid_bpf_device_event") in
> [2]).
>
> >
> > > Last, I'm not entirely sure how I can specify which struct_ops needs to be
> > > attached to which device, but it's worth a shot. I've already realized
> > > that I would probably have to drop the current way of HID-BPF is running,
> > > so now it's just technical bits to assemble :)
> >
> > You need to call different bpf progs per device, right?
>
> yes
>
> > If indirect call is fine from performance pov,
> > then tailcall or struct_ops+device_argument might fit.
>
> performance is not a requirement. It's better if we have low latency but
> we are not talking the same requirements than network.
>
> >
> > If you want max perf with direct calls then
> > we'd need to generalize xdp dispatcher.
>
> I'll need to have a deeper look at it, yeah.
>
> >
> > So far it sounds that tailcalls might be the best actually,
> > since prog lifetime is handled by prog array map.
> > Maybe instead of bpf_tail_call helper we should add a kfunc that
> > will operate on prog array differently?
> > (if current bpf_tail_call semantics don't fit).
>
> Actually I'd like to remove bpf_tail_call entirely, because it requires
> to pre-load a BPF program at boot, and in some situations (RHEL) this
> creates issues. I haven't been able to debug what was happening, I
> couldn't reproduce it myself, but removing that bit would be nice :)

We probably need to debug it anyway, since it sounds that it's
related to preloaded bpf skeleton and not tail_call logic itself.

After looking through all that it seems to me that
parametrized struct-ops is the way to go.