Re: [PATCH bpf-next v3 1/2] libbpf: Add BPF_KPROBE_SYSCALL macro

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 8 Feb 2022 21:53:55 -0800

On Mon, Feb 7, 2022 at 1:58 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Feb 7, 2022 at 6:31 AM Hengqi Chen <hengqi.chen@xxxxxxxxx> wrote:
> >
> > Add syscall-specific variant of BPF_KPROBE named BPF_KPROBE_SYSCALL ([0]).
> > The new macro hides the underlying way of getting syscall input arguments.
> > With the new macro, the following code:
> >
> >     SEC("kprobe/__x64_sys_close")
> >     int BPF_KPROBE(do_sys_close, struct pt_regs *regs)
> >     {
> >         int fd;
> >
> >         fd = PT_REGS_PARM1_CORE(regs);
> >         /* do something with fd */
> >     }
> >
> > can be written as:
> >
> >     SEC("kprobe/__x64_sys_close")
> >     int BPF_KPROBE_SYSCALL(do_sys_close, int fd)
> >     {
> >         /* do something with fd */
> >     }
> >
> >   [0] Closes: https://github.com/libbpf/libbpf/issues/425
> >
> > Signed-off-by: Hengqi Chen <hengqi.chen@xxxxxxxxx>
> > ---
> >  tools/lib/bpf/bpf_tracing.h | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h
> > index cf980e54d331..7ad9cdea99e1 100644
> > --- a/tools/lib/bpf/bpf_tracing.h
> > +++ b/tools/lib/bpf/bpf_tracing.h
> > @@ -461,4 +461,37 @@ typeof(name(0)) name(struct pt_regs *ctx)                              \
> >  }                                                                          \
> >  static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
> >
> > +#define ___bpf_syscall_args0()           ctx
> > +#define ___bpf_syscall_args1(x)          ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs)
> > +#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs)
> > +#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs)
> > +#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs)
> > +#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs)
> > +#define ___bpf_syscall_args(args...)     ___bpf_apply(___bpf_syscall_args, ___bpf_narg(args))(args)
> > +
> > +/*
> > + * BPF_KPROBE_SYSCALL is a variant of BPF_KPROBE, which is intended for
> > + * tracing syscall functions, like __x64_sys_close. It hides the underlying
> > + * platform-specific low-level way of getting syscall input arguments from
> > + * struct pt_regs, and provides a familiar typed and named function arguments
> > + * syntax and semantics of accessing syscall input parameters.
> > + *
> > + * Original struct pt_regs* context is preserved as 'ctx' argument. This might
> > + * be necessary when using BPF helpers like bpf_perf_event_output().
> > + */
>
> LGTM. Please also mention that this macro relies on CO-RE so that
> users are aware.
>

Now that Ilya's fixes are in again, added a small note about reliance
on BPF CO-RE and pushed to bpf-next, thanks.

On a relevant note. The whole __x64_sys_close vs sys_close depending
on architecture and kernel version was always super annoying. BCC
makes this transparent to users (AFAIK) and it always bothered me a
little, but I didn't see a clean solution that fits libbpf.

I think I finally found it, though. Instead of guessing whether the
kprobe function is a syscall or not based on "sys_" prefix of a kernel
function, we can use libbpf SEC() handling to do this transparently.
What if we define two new SEC() definitions:

SEC("ksyscall/write") and SEC("kretsyscall/write") (or maybe
SEC("kprobe.syscall/write") and SEC("kretprobe.syscall/write"), not
sure which one is better, voice your opinion, please). And for such
special kprobes, libbpf will perform feature detection of this
ARCH_SYSCALL_WRAPPER (we'll need to see the best way to do this in a
simple and fast way, preferably without parsing kallsyms) and
depending on it substitute either sys_write (or should it be
__se_sys_write, according to Naveen) or __<arch>_sys_write. You get
the idea.

I like that this is still explicit and in the spirit of libbpf, but
offloads the burden of knowing these intricate differences from users.

Thoughts?

> Unfortunately we had to back out Ilya's patches with
> PT_REGS_SYSCALL_REGS() and PT_REGS_PARMx_CORE_SYSCALL(), so we'll need
> to wait a bit before merging this.
>
>
> > +#define BPF_KPROBE_SYSCALL(name, args...)                                  \
> > +name(struct pt_regs *ctx);                                                 \
> > +static __attribute__((always_inline)) typeof(name(0))                      \
> > +____##name(struct pt_regs *ctx, ##args);                                   \
> > +typeof(name(0)) name(struct pt_regs *ctx)                                  \
> > +{                                                                          \
> > +       struct pt_regs *regs = PT_REGS_SYSCALL_REGS(ctx);                   \
> > +       _Pragma("GCC diagnostic push")                                      \
> > +       _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")              \
> > +       return ____##name(___bpf_syscall_args(args));                       \
> > +       _Pragma("GCC diagnostic pop")                                       \
> > +}                                                                          \
> > +static __attribute__((always_inline)) typeof(name(0))                      \
> > +____##name(struct pt_regs *ctx, ##args)
> > +
> >  #endif
> > --
> > 2.30.2