Re: [PATCH bpf-next v3 1/2] libbpf: Add BPF_KPROBE_SYSCALL macro

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 9 Feb 2022 08:38:16 -0800

On Wed, Feb 9, 2022 at 2:25 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Feb 7, 2022 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Mon, Feb 7, 2022 at 6:31 AM Hengqi Chen <hengqi.chen@xxxxxxxxx> wrote:
> > >
> > > Add syscall-specific variant of BPF_KPROBE named BPF_KPROBE_SYSCALL ([0]).
> > > The new macro hides the underlying way of getting syscall input arguments.
> > > With the new macro, the following code:
> > >
> > >     SEC("kprobe/__x64_sys_close")
> > >     int BPF_KPROBE(do_sys_close, struct pt_regs *regs)
> > >     {
> > >         int fd;
> > >
> > >         fd = PT_REGS_PARM1_CORE(regs);
> > >         /* do something with fd */
> > >     }
> > >
> > > can be written as:
> > >
> > >     SEC("kprobe/__x64_sys_close")
> > >     int BPF_KPROBE_SYSCALL(do_sys_close, int fd)
> > >     {
> > >         /* do something with fd */
> > >     }
> > >
> > >   [0] Closes: https://github.com/libbpf/libbpf/issues/425
> > >
> > > Signed-off-by: Hengqi Chen <hengqi.chen@xxxxxxxxx>
> > > ---
> > >  tools/lib/bpf/bpf_tracing.h | 33 +++++++++++++++++++++++++++++++++
> > >  1 file changed, 33 insertions(+)
> > >
> > > diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h
> > > index cf980e54d331..7ad9cdea99e1 100644
> > > --- a/tools/lib/bpf/bpf_tracing.h
> > > +++ b/tools/lib/bpf/bpf_tracing.h
> > > @@ -461,4 +461,37 @@ typeof(name(0)) name(struct pt_regs *ctx)                              \
> > >  }                                                                          \
> > >  static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
> > >
> > > +#define ___bpf_syscall_args0()           ctx
> > > +#define ___bpf_syscall_args1(x)          ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs)
> > > +#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs)
> > > +#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs)
> > > +#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs)
> > > +#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs)
> > > +#define ___bpf_syscall_args(args...)     ___bpf_apply(___bpf_syscall_args, ___bpf_narg(args))(args)
> > > +
> > > +/*
> > > + * BPF_KPROBE_SYSCALL is a variant of BPF_KPROBE, which is intended for
> > > + * tracing syscall functions, like __x64_sys_close. It hides the underlying
> > > + * platform-specific low-level way of getting syscall input arguments from
> > > + * struct pt_regs, and provides a familiar typed and named function arguments
> > > + * syntax and semantics of accessing syscall input parameters.
> > > + *
> > > + * Original struct pt_regs* context is preserved as 'ctx' argument. This might
> > > + * be necessary when using BPF helpers like bpf_perf_event_output().
> > > + */
> >
> > LGTM. Please also mention that this macro relies on CO-RE so that
> > users are aware.
> >
>
> Now that Ilya's fixes are in again, added a small note about reliance
> on BPF CO-RE and pushed to bpf-next, thanks.
>
>
> On a relevant note. The whole __x64_sys_close vs sys_close depending
> on architecture and kernel version was always super annoying. BCC
> makes this transparent to users (AFAIK) and it always bothered me a
> little, but I didn't see a clean solution that fits libbpf.
>
> I think I finally found it, though. Instead of guessing whether the
> kprobe function is a syscall or not based on "sys_" prefix of a kernel
> function, we can use libbpf SEC() handling to do this transparently.
> What if we define two new SEC() definitions:
>
> SEC("ksyscall/write") and SEC("kretsyscall/write") (or maybe
> SEC("kprobe.syscall/write") and SEC("kretprobe.syscall/write"), not
> sure which one is better, voice your opinion, please). And for such
> special kprobes, libbpf will perform feature detection of this
> ARCH_SYSCALL_WRAPPER (we'll need to see the best way to do this in a
> simple and fast way, preferably without parsing kallsyms) and
> depending on it substitute either sys_write (or should it be
> __se_sys_write, according to Naveen) or __<arch>_sys_write. You get
> the idea.
>
> I like that this is still explicit and in the spirit of libbpf, but
> offloads the burden of knowing these intricate differences from users.
>
> Thoughts?

I think it will be just as fragile.
That syscall prefix was changed by the kernel few times now.
libbpf will be chasing the moving target.
I think keeping the magic in .h is simpler and less of a maintenance burden.