Re: [PATCH] bpf: verifier: allow arrays of progs to be used in sleepable context

Benjamin Tissoires <bentiss@xxxxxxxxxx> · Tue, 30 Apr 2024 12:02:59 +0200

On Apr 24 2024, Alexei Starovoitov wrote:
> On Wed, Apr 24, 2024 at 7:17 AM Benjamin Tissoires <bentiss@xxxxxxxxxx> wrote:
> >
> > On Apr 22 2024, Benjamin Tissoires wrote:
> > > On Apr 22 2024, Daniel Borkmann wrote:
> > > > On 4/22/24 9:16 AM, Benjamin Tissoires wrote:
> > > > > Arrays of progs are underlying using regular arrays, but they can only
> > > > > be updated from a syscall.
> > > > > Therefore, they should be safe to use while in a sleepable context.
> > > > >
> > > > > This is required to be able to call bpf_tail_call() from a sleepable
> > > > > tracing bpf program.
> > > > >
> > > > > Signed-off-by: Benjamin Tissoires <bentiss@xxxxxxxxxx>
> > > > > ---
> > > > > Hi,
> > > > >
> > > > > a small patch to allow to have:
> > > > >
> > > > > ```
> > > > > SEC("fmod_ret.s/__hid_bpf_tail_call_sleepable")
> > > > > int BPF_PROG(hid_tail_call_sleepable, struct hid_bpf_ctx *hctx)
> > > > > {
> > > > >   bpf_tail_call(ctx, &hid_jmp_table, hctx->index);
> > > > >
> > > > >   return 0;
> > > > > }
> > > > > ```
> > > > >
> > > > > This should allow me to add bpf hooks to functions that communicate with
> > > > > the hardware.
> > > >
> > > > Could you also add selftests to it? In particular, I'm thinking that this is not
> > > > sufficient given also bpf_prog_map_compatible() needs to be extended to check on
> > > > prog->sleepable. For example we would need to disallow calling sleepable programs
> > > > in that map from non-sleepable context.
> > >
> > > Just to be sure, if I have to change bpf_prog_map_compatible(), that
> > > means that a prog array map can only have sleepable or non-sleepable
> > > programs, but not both at the same time?
> > >
> > > FWIW, indeed, I just tested and the BPF verifier/core is happy with this
> > > patch only if the bpf_tail_call is issued from a non-sleepable context
> > > (and crashes as expected).
> > >
> > > But that seems to be a different issue TBH: I can store a sleepable BPF
> > > program in a prog array and run it from a non sleepable context. I don't
> > > need the patch at all as bpf_tail_call() is normally declared. I assume
> > > your suggestion to change bpf_prog_map_compatible() will fix that part.
> > >
> > > I'll digg some more tomorrow.
> > >
> >
> > Quick update:
> > forcing the prog array to only contain sleepable programs or not seems
> > to do the trick, but I'm down a rabbit hole as when I return from my
> > trampoline, I get an invalid page fault, trying to execute NX-protected
> > page.
> >
> > I'll report if it's because of HID-BPF or if there are more work to be
> > doing for bpf_tail_call (which I suspect).
> 
> bpf_tail_call is an old mechanism.
> Instead of making it work for sleepable (which is ok to do)
> have you considered using "freplace" logic to "add bpf hooks to functions" ?
> You can have a global noinline function and replace it at run-time
> with another bpf program.
> Like:
> __attribute__ ((noinline))
> int get_constant(long val)
> {
>         return val - 122;
> }
> 
> in progs/test_pkt_access.c
> 
> is replaced with progs/freplace_get_constant.c
> 
> With freplace you can pass normal arguments, do the call and get
> return value, while with bpf_tail_call it's ctx only and no return.

This is interesting. Thanks!

However, I'm not sure that this would fit for my use case.

Basically, what I am doing is storing a list of bpf program I want to
run on a particular device for a given function.

Right now, what I am doing is (in simplified pseudo code):
- in a bpf program, the user calls hid_bpf_attach_prog(hid_device, program_fd)
  where program fd is a tracing program on a never executed function
  but this allows to know the type of program to run
- the kernel stores that program into a dedicated prog array bpf_map
  pre-loaded at boot time
- when a event comes in, the kernel walks through the list of attached
  programs, calls __hid_bpf_tail_call() and there is a tracing program
  attached to it that just do the bpf_tail_call.

This works and is simple enough from the user point of view, but is
rather inefficient and clunky from the kernel point of view IMO.

The freplace mechnism would definitely work if I had a tracing-like
function to call, where I need to run the program any time the function
gets called. But given that I want per-device filtering, I'm not sure
how I could make this work. But given that I need to enable or not the
bpf_program, I'm not sure how I could make it work from the kernel point
of view.

I tried using a simple bpf_prog_run() (which is exactly what I need in
the end) but I couldn't really convince the bpf verifier that the
provided context is a struct hid_bpf_ctx kernel pointer, and it felt not
quite right.

So after seeing how the bpf_wq worked internally, and how simple it is
now to call a bpf program from the kernel as a simple function call, I
played around with allowing kfunc to declare async callback functions.

I have a working prototype (probably not fully functional for all of the
cases), but I would like to know if you think it would be interesting to
have 3 new suffixes:
- "__async" for declaring an static bpf program that can be stored in
  the kernel and which would be non sleepable
- "__s_async" same as before, but for sleepable operations
- "__aux" (or "__prog_aux") for that extra parameter to
  bpf_wq_set_callback_impl() which contains the struct bpf_prog*.

(I still don't have the __aux yet FWIW)

The way I'm doing it is looking at the btf information to fetch the
signature of the parameters of the callback, this way we can declare any
callback without having to teach the verifier of is arguments (5 max).

Is this something you would be comfortable with or is there a simpler
mechanism already in place to call the bpf programs from the kernel
without the ctx limitations?

I can also easily switch the bpf_wq specific cases in the verifier with
those suffixes. There are still one or two wq specifics I haven't
implemented through __s_async, but that would still makes things more
generic.

Cheers,
Benjamin