Re: [PATCH v3 bpf-next 1/7] libbpf: add BPF-side of USDT support

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 7 Apr 2022 16:27:32 -0700

On Thu, Apr 7, 2022 at 7:19 AM Ilya Leoshkevich <iii@xxxxxxxxxxxxx> wrote:
>
> On Mon, 2022-04-04 at 16:41 -0700, Andrii Nakryiko wrote:
> > Add BPF-side implementation of libbpf-provided USDT support. This
> > consists of single header library, usdt.bpf.h, which is meant to be
> > used
> > from user's BPF-side source code. This header is added to the list of
> > installed libbpf header, along bpf_helpers.h and others.
> >
> > BPF-side implementation consists of two BPF maps:
> >   - spec map, which contains "a USDT spec" which encodes information
> >     necessary to be able to fetch USDT arguments and other
> > information
> >     (argument count, user-provided cookie value, etc) at runtime;
> >   - IP-to-spec-ID map, which is only used on kernels that don't
> > support
> >     BPF cookie feature. It allows to lookup spec ID based on the
> > place
> >     in user application that triggers USDT program.
> >
> > These maps have default sizes, 256 and 1024, which are chosen
> > conservatively to not waste a lot of space, but handling a lot of
> > common
> > cases. But there could be cases when user application needs to either
> > trace a lot of different USDTs, or USDTs are heavily inlined and
> > their
> > arguments are located in a lot of differing locations. For such cases
> > it
> > might be necessary to size those maps up, which libbpf allows to do
> > by
> > overriding BPF_USDT_MAX_SPEC_CNT and BPF_USDT_MAX_IP_CNT macros.
> >
> > It is an important aspect to keep in mind. Single USDT (user-space
> > equivalent of kernel tracepoint) can have multiple USDT "call sites".
> > That is, single logical USDT is triggered from multiple places in
> > user
> > application. This can happen due to function inlining. Each such
> > inlined
> > instance of USDT invocation can have its own unique USDT argument
> > specification (instructions about the location of the value of each
> > of
> > USDT arguments). So while USDT looks very similar to usual uprobe or
> > kernel tracepoint, under the hood it's actually a collection of
> > uprobes,
> > each potentially needing different spec to know how to fetch
> > arguments.
> >
> > User-visible API consists of three helper functions:
> >   - bpf_usdt_arg_cnt(), which returns number of arguments of current
> > USDT;
> >   - bpf_usdt_arg(), which reads value of specified USDT argument (by
> >     it's zero-indexed position) and returns it as 64-bit value;
> >   - bpf_usdt_cookie(), which functions like BPF cookie for USDT
> >     programs; this is necessary as libbpf doesn't allow specifying
> > actual
> >     BPF cookie and utilizes it internally for USDT support
> > implementation.
> >
> > Each bpf_usdt_xxx() APIs expect struct pt_regs * context, passed into
> > BPF program. On kernels that don't support BPF cookie it is used to
> > fetch absolute IP address of the underlying uprobe.
> >
> > usdt.bpf.h also provides BPF_USDT() macro, which functions like
> > BPF_PROG() and BPF_KPROBE() and allows much more user-friendly way to
> > get access to USDT arguments, if USDT definition is static and known
> > to
> > the user. It is expected that majority of use cases won't have to use
> > bpf_usdt_arg_cnt() and bpf_usdt_arg() directly and BPF_USDT() will
> > cover
> > all their needs.
> >
> > Last, usdt.bpf.h is utilizing BPF CO-RE for one single purpose: to
> > detect kernel support for BPF cookie. If BPF CO-RE dependency is
> > undesirable, user application can redefine BPF_USDT_HAS_BPF_COOKIE to
> > either a boolean constant (or equivalently zero and non-zero), or
> > even
> > point it to its own .rodata variable that can be specified from
> > user's
> > application user-space code. It is important that
> > BPF_USDT_HAS_BPF_COOKIE is known to BPF verifier as static value
> > (thus
> > .rodata and not just .data), as otherwise BPF code will still contain
> > bpf_get_attach_cookie() BPF helper call and will fail validation at
> > runtime, if not dead-code eliminated.
> >
> > Reviewed-by: Alan Maguire <alan.maguire@xxxxxxxxxx>
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  tools/lib/bpf/Makefile   |   2 +-
> >  tools/lib/bpf/usdt.bpf.h | 256
> > +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 257 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/lib/bpf/usdt.bpf.h
>
> [...]
>
> > diff --git a/tools/lib/bpf/usdt.bpf.h b/tools/lib/bpf/usdt.bpf.h
> > new file mode 100644
> > index 000000000000..60237acf6b02
> > --- /dev/null
> > +++ b/tools/lib/bpf/usdt.bpf.h
> > @@ -0,0 +1,256 @@
>
> [...]
>
> > +/* Fetch USDT argument #*arg_num* (zero-indexed) and put its value
> > into *res.
> > + * Returns 0 on success; negative error, otherwise.
> > + * On error *res is guaranteed to be set to zero.
> > + */
> > +static inline __noinline
> > +int bpf_usdt_arg(struct pt_regs *ctx, __u64 arg_num, long *res)
> > +{
> > +       struct __bpf_usdt_spec *spec;
> > +       struct __bpf_usdt_arg_spec *arg_spec;
> > +       unsigned long val;
> > +       int err, spec_id;
> > +
> > +       *res = 0;
> > +
> > +       spec_id = __bpf_usdt_spec_id(ctx);
> > +       if (spec_id < 0)
> > +               return -ESRCH;
> > +
> > +       spec = bpf_map_lookup_elem(&__bpf_usdt_specs, &spec_id);
> > +       if (!spec)
> > +               return -ESRCH;
> > +
> > +       if (arg_num >= BPF_USDT_MAX_ARG_CNT || arg_num >= spec-
> > >arg_cnt)
> > +               return -ENOENT;
> > +
> > +       arg_spec = &spec->args[arg_num];
> > +       switch (arg_spec->arg_type) {
> > +       case BPF_USDT_ARG_CONST:
> > +               /* Arg is just a constant ("-4@$-9" in USDT arg
> > spec).
> > +                * value is recorded in arg_spec->val_off directly.
> > +                */
> > +               val = arg_spec->val_off;
> > +               break;
> > +       case BPF_USDT_ARG_REG:
> > +               /* Arg is in a register (e.g, "8@%rax" in USDT arg
> > spec),
> > +                * so we read the contents of that register directly
> > from
> > +                * struct pt_regs. To keep things simple user-space
> > parts
> > +                * record offsetof(struct pt_regs, <regname>) in
> > arg_spec->reg_off.
> > +                */
> > +               err = bpf_probe_read_kernel(&val, sizeof(val), (void
> > *)ctx + arg_spec->reg_off);
> > +               if (err)
> > +                       return err;
> > +               break;
> > +       case BPF_USDT_ARG_REG_DEREF:
> > +               /* Arg is in memory addressed by register, plus some
> > offset
> > +                * (e.g., "-4@-1204(%rbp)" in USDT arg spec).
> > Register is
> > +                * identified lik with BPF_USDT_ARG_REG case, and the
> > offset
> > +                * is in arg_spec->val_off. We first fetch register
> > contents
> > +                * from pt_regs, then do another user-space probe
> > read to
> > +                * fetch argument value itself.
> > +                */
> > +               err = bpf_probe_read_kernel(&val, sizeof(val), (void
> > *)ctx + arg_spec->reg_off);
> > +               if (err)
> > +                       return err;
> > +               err = bpf_probe_read_user(&val, sizeof(val), (void
> > *)val + arg_spec->val_off);
>
> Is there a reason we always read 8 bytes here?
> What if the user is interested in the last byte of a page?

Well, for one it was simplicity and keeping arg_spec compact (there
could be a lot of them, so keeping it at 16 bytes is useful). But it
also seems like a very unlikely event. And either way even completely
valid page might not be paged in anyways, so some read might fail
anyways.

But I think really it was due to technical limitation.
bpf_probe_read_{kernel,user}() expects statically known size of the
read, so we'd have to have an ugly switch here with branches for 1, 2,
4, and 8, which is ugly. So given highly unlikely situation you
described, I figured it should be fine.

>
> [...]