Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 2 Aug 2019 23:30:21 -0700

On Fri, Aug 2, 2019 at 2:56 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Aug 02, 2019 at 12:16:52AM -0700, Andrii Nakryiko wrote:
> > On Thu, Aug 1, 2019 at 4:50 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Jul 31, 2019 at 11:47:53PM -0700, Andrii Nakryiko wrote:
> > > > This patch implements the core logic for BPF CO-RE offsets relocations.
> > > > Every instruction that needs to be relocated has corresponding
> > > > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > > > to match recorded "local" relocation spec against potentially many
> > > > compatible "target" types, creating corresponding spec. Details of the
> > > > algorithm are noted in corresponding comments in the code.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andriin@xxxxxx>
> > > > Acked-by: Song Liu <songliubraving@xxxxxx>
> > > ...
> > > > +             if (btf_is_composite(t)) {
> > > > +                     const struct btf_member *m = (void *)(t + 1);
> > > > +                     __u32 offset;
> > > > +
> > > > +                     if (access_idx >= BTF_INFO_VLEN(t->info))
> > > > +                             return -EINVAL;
> > > > +
> > > > +                     m = &m[access_idx];
> > > > +
> > > > +                     if (BTF_INFO_KFLAG(t->info)) {
> > > > +                             if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> > > > +                                     return -EINVAL;
> > > > +                             offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> > > > +                     } else {
> > > > +                             offset = m->offset;
> > > > +                     }
> > >
> > > very similar logic exists in btf_dump.c
> > > probably makes sense to make a common helper at some point.
> >
> > Will add btf_member_bit_offset(type, member) and
> > btf_member_bit_size(type, member).
> >
> > >
> > > > +static size_t bpf_core_essential_name_len(const char *name)
> > > > +{
> > > > +     size_t n = strlen(name);
> > > > +     int i = n - 3;
> > > > +
> > > > +     while (i > 0) {
> > > > +             if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> > > > +                     return i;
> > > > +             i--;
> > > > +     }
> > > > +     return n;
> > > > +}
> > >
> > > that's a bit of an eye irritant. How about?
> > >         size_t n = strlen(name);
> > >         int i, cnt = 0;
> > >
> > >         for (i = n - 1; i >= 0; i--) {
> > >                 if (name[i] == '_') {
> > >                     cnt++;
> > >                 } else {
> > >                    if (cnt == 3)
> > >                       return i + 1;
> > >                    cnt = 0;
> > >                 }
> > >         }
> > >         return n;
> >
> > I find this one much harder to read and understand. What's
> > eye-irritating about that loop?
> >
> > Your loop will also handle `a____b` differently. My version will
> > return "a_" as essential name, yours "a____b". Was this intentional on
> > your part?
>
> hmm. I think both will return sizeof("a") == 1

nope, there are 4 underscores, your implementation will bump cnt to 4
without checking it for `cnt == 3`, so will never detect flavor and
will just return sizeof("a____b"). It's easily fixable, but the point
is that my original irritating code is very straightforward and harder
to get wrong an, easier to understand at a glimpse, yours require much
more conscious thought to understand.

>
> > I'd rather use this instead, if you hate the first one:
> >
> > size_t n = strlen(name);
> > int i;
> >
> > for (i = n - 3; i > 0; i--) {
> >     if (strncmp(name + i, "___", 3) == 0)
> >         return i;
> > }
> >
> > Is this better?
>
> that is worse.
> What I don't like about it is that every byte is
> compared N=sizeof(string-to-found) times.
> I guess it's not such a big performance criticial path,
> but libbpf has to keep the bar high.

We are talking about searching for *three* characters in a short
string. Performance difference is negligible at best, unnoticeable at
worst. I'd rather have straightforward and easy code, but I'll rewrite
it as a state machine the way you proposed.

>
> > >
> > > > +     case BTF_KIND_ARRAY: {
> > > > +             const struct btf_array *loc_a, *targ_a;
> > > > +
> > > > +             loc_a = (void *)(local_type + 1);
> > > > +             targ_a = (void *)(targ_type + 1);
> > > > +             local_id = loc_a->type;
> > > > +             targ_id = targ_a->type;
> > >
> > > can we add a helper like:
> >
> > Yes, we can. I was thinking about that, but decided to not expand
> > patch set. But we do need to extract all those small, but nice
> > helpers. I'll put them in libbpf_internal.h for now, but I think it
> > might be good idea to expose them as part of btf.h. Thoughts?
>
> part of btf.h make sense to me.
>
> >
> > > const struct btf_array *btf_array(cosnt struct btf_type *t)
> > > {
> > >         return (const struct btf_array *)(t + 1);
> > > }
> > >
> > > then above will be:
> > >         case BTF_KIND_ARRAY: {
> > >                 local_id = btf_array(local_type)->type;
> > >                 targ_id = btf_array(targ_type)->type;
> > >
> > > and a bunch of code in btf.c and btf_dump.c would be cleaner as well?
> >
> > Yep, some of those are already scattered around btf.c and btf_dump.c,
> > will clean up and add patch to this patch set.
> >
> > >
> > > > +             goto recur;
> > > > +     }
> > > > +     default:
> > > > +             pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> > > > +                        kind, local_id, targ_id);
> > > > +             return 0;
> > > > +     }
> > > > +}
> > > > +
> > > > +/*
> > > > + * Given single high-level named field accessor in local type, find
> > > > + * corresponding high-level accessor for a target type. Along the way,
> > > > + * maintain low-level spec for target as well. Also keep updating target
> > > > + * offset.
> > > > + *
> > > > + * Searching is performed through recursive exhaustive enumeration of all
> > > > + * fields of a struct/union. If there are any anonymous (embedded)
> > > > + * structs/unions, they are recursively searched as well. If field with
> > > > + * desired name is found, check compatibility between local and target types,
> > > > + * before returning result.
> > > > + *
> > > > + * 1 is returned, if field is found.
> > > > + * 0 is returned if no compatible field is found.
> > > > + * <0 is returned on error.
> > > > + */
> > > > +static int bpf_core_match_member(const struct btf *local_btf,
> > > > +                              const struct bpf_core_accessor *local_acc,
> > > > +                              const struct btf *targ_btf,
> > > > +                              __u32 targ_id,
> > > > +                              struct bpf_core_spec *spec,
> > > > +                              __u32 *next_targ_id)
> > > > +{
> > > > +     const struct btf_type *local_type, *targ_type;
> > > > +     const struct btf_member *local_member, *m;
> > > > +     const char *local_name, *targ_name;
> > > > +     __u32 local_id;
> > > > +     int i, n, found;
> > > > +
> > > > +     targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> > > > +     if (!targ_type)
> > > > +             return -EINVAL;
> > > > +     if (!btf_is_composite(targ_type))
> > > > +             return 0;
> > > > +
> > > > +     local_id = local_acc->type_id;
> > > > +     local_type = btf__type_by_id(local_btf, local_id);
> > > > +     local_member = (void *)(local_type + 1);
> > > > +     local_member += local_acc->idx;
> > > > +     local_name = btf__name_by_offset(local_btf, local_member->name_off);
> > > > +
> > > > +     n = BTF_INFO_VLEN(targ_type->info);
> > > > +     m = (void *)(targ_type + 1);
> > >
> > > new btf_member() helper?
> > >
> > > > +     for (i = 0; i < n; i++, m++) {
> > > > +             __u32 offset;
> > > > +
> > > > +             /* bitfield relocations not supported */
> > > > +             if (BTF_INFO_KFLAG(targ_type->info)) {
> > > > +                     if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> > > > +                             continue;
> > > > +                     offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> > > > +             } else {
> > > > +                     offset = m->offset;
> > > > +             }
> > > > +             if (offset % 8)
> > > > +                     continue;
> > >
> > > same bit of code again?
> > > definitely could use a helper.
> >
> > Different handling (continue here, return error above), but can use
> > those helpers I mentioned above.
> >
> > >
> > > > +     for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> > > > +             targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> > > > +                                                &targ_id);
> > > > +             if (!targ_type)
> > > > +                     return -EINVAL;
> > > > +
> > > > +             if (local_acc->name) {
> > > > +                     matched = bpf_core_match_member(local_spec->btf,
> > > > +                                                     local_acc,
> > > > +                                                     targ_btf, targ_id,
> > > > +                                                     targ_spec, &targ_id);
> > > > +                     if (matched <= 0)
> > > > +                             return matched;
> > > > +             } else {
> > > > +                     /* for i=0, targ_id is already treated as array element
> > > > +                      * type (because it's the original struct), for others
> > > > +                      * we should find array element type first
> > > > +                      */
> > > > +                     if (i > 0) {
> > > > +                             const struct btf_array *a;
> > > > +
> > > > +                             if (!btf_is_array(targ_type))
> > > > +                                     return 0;
> > > > +
> > > > +                             a = (void *)(targ_type + 1);
> > > > +                             if (local_acc->idx >= a->nelems)
> > > > +                                     return 0;
> > >
> > > am I reading it correctly that the local spec requested out-of-bounds
> > > index in the target array type?
> > > Why this is 'ignore' instead of -EINVAL?
> >
> > Similar to any other mismatch (e.g., int in local type vs int64 in
> > target type). It just makes candidate not matching. Why would that be
> > error that will stop the whole relocation and subsequently object
> > loading process?
>
> Did the field name match or this is for anon types?
> I've read it as names matched and type miscompared.

No, not anonymous.

struct my_struct___local {
    int a;
};

struct my_struct___target {
    long long a;
};

my_struct___local->a will not match my_struct___target->a, but it's
not a reason to stop relocation process due to error.

>
> >
> > >
> > > > +/*
> > > > + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> > > > + * data out of it to use for target BTF.
> > > > + */
> > > > +static struct btf *bpf_core_find_kernel_btf(void)
> > > > +{
> > > > +     const char *locations[] = {
> > > > +             "/lib/modules/%1$s/vmlinux-%1$s",
> > > > +             "/usr/lib/modules/%1$s/kernel/vmlinux",
> > > > +     };
> > > > +     char path[PATH_MAX + 1];
> > > > +     struct utsname buf;
> > > > +     struct btf *btf;
> > > > +     int i, err;
> > > > +
> > > > +     err = uname(&buf);
> > > > +     if (err) {
> > > > +             pr_warning("failed to uname(): %d\n", err);
> > >
> > > defensive programming ?
> > > I think uname() can fail only if &buf points to non-existing page like null.
> >
> > I haven't checked source for this syscall, but man page specified that
> > it might return -1 on error.
>
> man page says that it can only return EFAULT.

Ah, yeah, seems to be the only reason. I'll remove the check, it
wasn't paranoia :)

>
> >
> > >
> > > > +             return ERR_PTR(err);
> > > > +     }
> > > > +
> > > > +     for (i = 0; i < ARRAY_SIZE(locations); i++) {
> > > > +             snprintf(path, PATH_MAX, locations[i], buf.release);
> > > > +             pr_debug("attempting to load kernel BTF from '%s'\n", path);
> > >
> > > I think this debug message would have been more useful after access().
> >
> > Sure, will move.
> >
> > >
> > > > +
> > > > +             if (access(path, R_OK))
> > > > +                     continue;
> > > > +
> > > > +             btf = btf__parse_elf(path, NULL);
> > > > +             if (IS_ERR(btf))
> > > > +                     continue;
> > > > +
> > > > +             pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> > > > +             return btf;
> > > > +     }
> > > > +
> > > > +     pr_warning("failed to find valid kernel BTF\n");
> > > > +     return ERR_PTR(-ESRCH);
> > > > +}
> > > > +
> > > > +/* Output spec definition in the format:
> > > > + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> > > > + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> > > > + */
> > > > +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> > > > +{
> > > > +     const struct btf_type *t;
> > > > +     const char *s;
> > > > +     __u32 type_id;
> > > > +     int i;
> > > > +
> > > > +     type_id = spec->spec[0].type_id;
> > > > +     t = btf__type_by_id(spec->btf, type_id);
> > > > +     s = btf__name_by_offset(spec->btf, t->name_off);
> > > > +     libbpf_print(level, "[%u] (%s) + ", type_id, s);
> > >
> > > imo extra []() don't improve readability of the dump.
> >
> > [<num>] is the general convention I've been using throughout libbpf to
> > specify type ID, so I'd rather keep it for consistency. I can drop
> > parens, though, no problem.
> >
> > >
> > > > +
> > > > +     for (i = 0; i < spec->raw_len; i++)
> > > > +             libbpf_print(level, "%d%s", spec->raw_spec[i],
> > > > +                          i == spec->raw_len - 1 ? " => " : ":");
> > > > +
> > > > +     libbpf_print(level, "%u @ &x", spec->offset);
> > > > +
> > > > +     for (i = 0; i < spec->len; i++) {
> > > > +             if (spec->spec[i].name)
> > > > +                     libbpf_print(level, ".%s", spec->spec[i].name);
> > > > +             else
> > > > +                     libbpf_print(level, "[%u]", spec->spec[i].idx);
> > > > +     }
> > > > +
> > > > +}
> > > > +
> > > > +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> > > > +{
> > > > +     return (size_t)key;
> > > > +}
> > > > +
> > > > +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> > > > +{
> > > > +     return k1 == k2;
> > > > +}
> > > > +
> > > > +static void *u32_to_ptr(__u32 x)
> > > > +{
> > > > +     return (void *)(uintptr_t)x;
> > > > +}
> > >
> > > u32 to pointer on 64-bit arch?!
> > > That surely needs a comment.
> >
> > I should probably call it u32_to_hash_key() to make it obvious it's
> > conversion to hashmap's generic `void *` key type.
> >
> > >
> > > > +
> > > > +/*
> > > > + * CO-RE relocate single instruction.
> > > > + *
> > > > + * The outline and important points of the algorithm:
> > > > + * 1. For given local type, find corresponding candidate target types.
> > > > + *    Candidate type is a type with the same "essential" name, ignoring
> > > > + *    everything after last triple underscore (___). E.g., `sample`,
> > > > + *    `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> > > > + *    for each other. Names with triple underscore are referred to as
> > > > + *    "flavors" and are useful, among other things, to allow to
> > > > + *    specify/support incompatible variations of the same kernel struct, which
> > > > + *    might differ between different kernel versions and/or build
> > > > + *    configurations.
> > > > + *
> > > > + *    N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> > > > + *    converter, when deduplicated BTF of a kernel still contains more than
> > > > + *    one different types with the same name. In that case, ___2, ___3, etc
> > > > + *    are appended starting from second name conflict. But start flavors are
> > > > + *    also useful to be defined "locally", in BPF program, to extract same
> > > > + *    data from incompatible changes between different kernel
> > > > + *    versions/configurations. For instance, to handle field renames between
> > > > + *    kernel versions, one can use two flavors of the struct name with the
> > > > + *    same common name and use conditional relocations to extract that field,
> > > > + *    depending on target kernel version.
> > >
> > > there are actual kernel types that have ___ in the name.
> > > Ex: struct lmc___media
> > > We probably need to revisit this 'flavor' convention.
> >
> > There are only these:
> > - lmc___softc
> > - lmc___media
> > - lmc___ctl (all three in drivers/net/wan/lmc/lmc_var.h)
> > - ____ftrace_##name set of structs
> >
> > I couldn't come up with anything cleaner-looking. I think we can still
> > keep ___ convention, but:
> >
> > 1. Match only exactly 3 underscores, delimited by non-underscore from
> > both sides (so similar to your proposed loop above);
> > 2. We can also try matching candidates assuming full name without
> > ___xxx part removed, in addition to current logic. This seems like an
> > overkill at this point and unlikely to be useful in practice, so I'd
> > postpone implementing this until we really need it.
> >
> > What do you think? Which other convention did you have in mind?
>
> may be match ___[0-9]+ instead for now?
> Not as flexible, but user supplied "flavors" is not an immediate task.

All the tests I added use non-numeric flavors. While technically I can
use just ___1, ___2 and so on, it will greatly reduce readability,
while not really solving any problem (nothing prevents someone to add
something like lmc___1 eventually).

I think it's not worth it to complicate this logic just for
lmc___{softc,media,ctl}, but we can do 2) - try to match any struct as
is. If that fails, see if it's a "flavor" and match flavors.

>
> >
> > >
> > > > +     for (i = 0, j = 0; i < cand_ids->len; i++) {
> > > > +             cand_id = cand_ids->data[i];
> > > > +             cand_type = btf__type_by_id(targ_btf, cand_id);
> > > > +             cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> > > > +
> > > > +             err = bpf_core_spec_match(&local_spec, targ_btf,
> > > > +                                       cand_id, &cand_spec);
> > > > +             if (err < 0) {
> > > > +                     pr_warning("prog '%s': relo #%d: failed to match spec ",
> > > > +                                prog_name, relo_idx);
> > > > +                     bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> > > > +                     libbpf_print(LIBBPF_WARN,
> > > > +                                  " to candidate #%d [%d] (%s): %d\n",
> > > > +                                  i, cand_id, cand_name, err);
> > > > +                     return err;
> > > > +             }
> > > > +             if (err == 0) {
> > > > +                     pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> > > > +                              prog_name, relo_idx, i, cand_id, cand_name);
> > > > +                     bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> > > > +                     libbpf_print(LIBBPF_DEBUG, "\n");
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > +             pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> > > > +                      prog_name, relo_idx, i);
> > >
> > > did you mention that you're going to make a helper for this debug dumps?
> >
> > yeah, I added bpf_core_dump_spec(), but I don't know how to shorten
> > this further... This output is extremely useful to understand what's
> > happening and will be invaluable when users will inevitably report
> > confusing behavior in some cases, so I still want to keep it.
>
> not sure yet. Just pointing out that this function has more debug printfs
> than actual code which doesn't look right.
> We have complex algorithms in the kernel (like verifier).
> Yet we don't sprinkle printfs in there to this degree.
>

We do have a verbose verifier logging, though, exactly to help users
to debug issues, which is extremely helpful and is greatly appreciated
by users.
There is nothing worse for developer experience than getting -EINVAL
without any useful log message. Been there, banged my head against the
wall wishing for a bit more verbose log. What are we trying to
optimize for here?