Re: [PATCH bpf-next v3 1/2] libbpf: Add BPF_KPROBE_SYSCALL macro

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 9 Feb 2022 09:47:46 -0800

On Wed, Feb 9, 2022 at 8:38 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Feb 9, 2022 at 2:25 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Mon, Feb 7, 2022 at 1:58 PM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Mon, Feb 7, 2022 at 6:31 AM Hengqi Chen <hengqi.chen@xxxxxxxxx> wrote:
> > > >
> > > > Add syscall-specific variant of BPF_KPROBE named BPF_KPROBE_SYSCALL ([0]).
> > > > The new macro hides the underlying way of getting syscall input arguments.
> > > > With the new macro, the following code:
> > > >
> > > >     SEC("kprobe/__x64_sys_close")
> > > >     int BPF_KPROBE(do_sys_close, struct pt_regs *regs)
> > > >     {
> > > >         int fd;
> > > >
> > > >         fd = PT_REGS_PARM1_CORE(regs);
> > > >         /* do something with fd */
> > > >     }
> > > >
> > > > can be written as:
> > > >
> > > >     SEC("kprobe/__x64_sys_close")
> > > >     int BPF_KPROBE_SYSCALL(do_sys_close, int fd)
> > > >     {
> > > >         /* do something with fd */
> > > >     }
> > > >
> > > >   [0] Closes: https://github.com/libbpf/libbpf/issues/425
> > > >
> > > > Signed-off-by: Hengqi Chen <hengqi.chen@xxxxxxxxx>
> > > > ---
> > > >  tools/lib/bpf/bpf_tracing.h | 33 +++++++++++++++++++++++++++++++++
> > > >  1 file changed, 33 insertions(+)
> > > >
> > > > diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h
> > > > index cf980e54d331..7ad9cdea99e1 100644
> > > > --- a/tools/lib/bpf/bpf_tracing.h
> > > > +++ b/tools/lib/bpf/bpf_tracing.h
> > > > @@ -461,4 +461,37 @@ typeof(name(0)) name(struct pt_regs *ctx)                              \
> > > >  }                                                                          \
> > > >  static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
> > > >
> > > > +#define ___bpf_syscall_args0()           ctx
> > > > +#define ___bpf_syscall_args1(x)          ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs)
> > > > +#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs)
> > > > +#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs)
> > > > +#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs)
> > > > +#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs)
> > > > +#define ___bpf_syscall_args(args...)     ___bpf_apply(___bpf_syscall_args, ___bpf_narg(args))(args)
> > > > +
> > > > +/*
> > > > + * BPF_KPROBE_SYSCALL is a variant of BPF_KPROBE, which is intended for
> > > > + * tracing syscall functions, like __x64_sys_close. It hides the underlying
> > > > + * platform-specific low-level way of getting syscall input arguments from
> > > > + * struct pt_regs, and provides a familiar typed and named function arguments
> > > > + * syntax and semantics of accessing syscall input parameters.
> > > > + *
> > > > + * Original struct pt_regs* context is preserved as 'ctx' argument. This might
> > > > + * be necessary when using BPF helpers like bpf_perf_event_output().
> > > > + */
> > >
> > > LGTM. Please also mention that this macro relies on CO-RE so that
> > > users are aware.
> > >
> >
> > Now that Ilya's fixes are in again, added a small note about reliance
> > on BPF CO-RE and pushed to bpf-next, thanks.
> >
> >
> > On a relevant note. The whole __x64_sys_close vs sys_close depending
> > on architecture and kernel version was always super annoying. BCC
> > makes this transparent to users (AFAIK) and it always bothered me a
> > little, but I didn't see a clean solution that fits libbpf.
> >
> > I think I finally found it, though. Instead of guessing whether the
> > kprobe function is a syscall or not based on "sys_" prefix of a kernel
> > function, we can use libbpf SEC() handling to do this transparently.
> > What if we define two new SEC() definitions:
> >
> > SEC("ksyscall/write") and SEC("kretsyscall/write") (or maybe
> > SEC("kprobe.syscall/write") and SEC("kretprobe.syscall/write"), not
> > sure which one is better, voice your opinion, please). And for such
> > special kprobes, libbpf will perform feature detection of this
> > ARCH_SYSCALL_WRAPPER (we'll need to see the best way to do this in a
> > simple and fast way, preferably without parsing kallsyms) and
> > depending on it substitute either sys_write (or should it be
> > __se_sys_write, according to Naveen) or __<arch>_sys_write. You get
> > the idea.
> >
> > I like that this is still explicit and in the spirit of libbpf, but
> > offloads the burden of knowing these intricate differences from users.
> >
> > Thoughts?
>
> I think it will be just as fragile.
> That syscall prefix was changed by the kernel few times now.
> libbpf will be chasing the moving target.
> I think keeping the magic in .h is simpler and less of a maintenance burden.

Absolutely it is simpler, but it only works for selftests (and even
then, when prefix changes again we'll need to adjust selftests). But
the point here is to not require end users to chase this instead. And
in the real world where users want to work on as wide a range of
kernel versions as possible SYS_PREFIX trick doesn't work (which is
why I didn't want to add that to bpf_tracing.h).

It's cheaper (maintenance-wise) to do it in libbpf than require all
the kprobe users to keep track of this, no? And good thing is that
libbpf is part of the kernel repo, so whenever someone changes this in
the kernel we'll get a failing selftest on next net-next -> bpf->next
sync (at least for architectures that we do test), so we'll be able to
fix and adjust quickly.