Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset relocation algorithm

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Sat, 27 Jul 2019 12:09:11 -0700

On Sat, Jul 27, 2019 at 11:59 AM Song Liu <songliubraving@xxxxxx> wrote:
>
>
>
> > On Jul 26, 2019, at 11:11 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Thu, Jul 25, 2019 at 12:32 PM Song Liu <songliubraving@xxxxxx> wrote:
> >>
> >>
> >>
> >>> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@xxxxxx> wrote:
> >>>
> >>> This patch implements the core logic for BPF CO-RE offsets relocations.
> >>> All the details are described in code comments.
> >>
> >> Some description in the change log is still useful. Please at least
> >> copy-paste key comments here.
> >
> > OK, will add some more.
> >
> >>
> >> And, this is looooong. I think it is totally possible to split it into
> >> multiple smaller patches.
> >
> > I don't really know how to split it further without hurting reviewing
> > by artificially splitting related code into separate patches. Remove
> > any single function and algorithm will be incomplete.
> >
> > Let me give you some high-level overview of how pieces are put
> > together. There are 9 non-trivial functions, let's go over their
> > purpose in the orderd in which they are defined in file:
> >
> > 1. bpf_core_spec_parse()
> >
> > This one take bpf_offset_reloc's type_id and accessor string
> > ("0:1:2:3") and parses it into more convenient bpf_core_spec
> > datastructure, which has calculated offset and high-level spec
> > "steps": either named field or array access.
> >
> > 2. bpf_core_find_cands()
> >
> > Given local type name, finds all possible target BTF types with same
> > name (modulo "flavor" differences, ___flavor suffix is just ignored).
> >
> > 3. bpf_core_fields_are_compat()
> >
> > Given local and target field match, checks that their types are
> > compatible (so that we don't accidentally match, e.g., int against
> > struct).
> >
> > 4. bpf_core_match_member()
> >
> > Given named local field, find corresponding field in target struct. To
> > understand why it's not trivial, here's an example:
> >
> > Local type:
> >
> > struct s___local {
> >  int a;
> > };
> >
> > Target type:
> >
> > struct s___target {
> >  struct {
> >    union {
> >      int a;
> >    };
> >  };
> > };
> >
> > For both cases you can access a as s.a, but in local case, field a is
> > immediately inside s___local, while for s___target, you have to
> > traverse two levels deeper into anonymous fields to get to an `a`
> > inside anonymous union.
> >
> > So this function find that `a` by doing exhaustive search across all
> > named field and anonymous struct/unions. But otherwise it's pretty
> > straightforward recursive function.
> >
> > bpf_core_spec_match()
> >
> > Just goes over high-level spec steps in local spec and tries to figure
> > out both high-level and low-level steps for targe type. Consider the
> > above example. For both structs accessing s.a is one high-level step,
> > but for s___local it's single low-level step (just another :0 in spec
> > string), while for s___target it's three low-level steps: ":0:0:0",
> > one step for each BTF type we need to traverse.
> >
> > Array access is simpler, it's always one high-level and one low-level step.
> >
> > bpf_core_reloc_insn()
> >
> > Once we match local and target specs and have local and target
> > offsets, do the relocations - check that instruction has expected
> > local offset and replace it with target offset.
> >
> > bpf_core_find_kernel_btf()
> >
> > This is the only function that can be moved into separate patch, but
> > it's also very simple. It just iterates over few known possible
> > locations for vmlinux image and once found, tries to parse .BTF out of
> > it, to be used as target BTF.
> >
> > bpf_core_reloc_offset()
> >
> > It combines all the above functions to perform single relocation.
> > Parse spec, get candidates, for each candidate try to find matching
> > target spec. All candidates that matched are cached for given local
> > root type.
>
> Thanks for these explanation. They are really helpful.
>
> I think some example explaining each step of bpf_core_reloc_offset()
> will be very helpful. Something like:
>
> Example:
>
> struct s {
>         int a;
>         struct {
>                 int b;
>                 bool c;
>         };
> };
>
> To get offset for c, we do:
>
> bpf_core_reloc_offset() {
>
>         /* input data: xxx */
>
>         /* first step: bpf_core_spec_parse() */
>
>         /* data after first step */
>
>         /* second step: bpf_core_find_cands() */
>
>         /* candidate A and B after second step */
>
>         ...
> }
>
> Well, it requires quite some work to document this way. Please let me
> know if you feel this is an overkill.

Yeah :) And it's not just work, but I think it's bad if comments
become too specific and document very low-level steps, because code
might evolve and comments can quickly get out of sync and just add to
confusion. Which is why I tried to document high-level ideas, leaving
it up to the source code to be the ultimate reference of minutia
details.

>
> >
> > bpf_core_reloc_offsets()
> >
> > High-level coordination. Iterate over all per-program .BTF.ext offset
> > reloc sections, each relocation within them. Find corresponding
> > program and try to apply relocations one by one.
> >
> >
> > I think the only non-obvious part here is to understand that
> > relocation records local raw spec with every single anonymous type
> > traversal, which is not that useful when we try to match it against
> > target type, which can have very different composition, but still the
> > same field access pattern, from C language standpoint (which hides all
> > those anonymous type traversals from programmer).
> >
> > But it should be pretty clear now, plus also check tests, they have
> > lots of cases showing what's compatible and what's not.
>
> I see. I will review the tests.
>
> >>>
> >>> static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> >>> -                                                  __u32 id)
> >>> +                                                  __u32 id,
> >>> +                                                  __u32 *res_id)
> >>> {
> >>>      const struct btf_type *t = btf__type_by_id(btf, id);
> >>
> >> Maybe have a local "__u32 rid;"
> >>
> >>>
> >>> +     if (res_id)
> >>> +             *res_id = id;
> >>> +
> >>
> >> and do "rid = id;" here
> >>
> >>>      while (true) {
> >>>              switch (BTF_INFO_KIND(t->info)) {
> >>>              case BTF_KIND_VOLATILE:
> >>>              case BTF_KIND_CONST:
> >>>              case BTF_KIND_RESTRICT:
> >>>              case BTF_KIND_TYPEDEF:
> >>> +                     if (res_id)
> >>> +                             *res_id = t->type;
> >> and here
> >>
> >>>                      t = btf__type_by_id(btf, t->type);
> >>>                      break;
> >>>              default:
> >> and "*res_id = rid;" right before return?
> >
> > Sure, but why?
>
> I think it is cleaner that way. But feel free to ignore if you
> think otherwise.
>
> >
> >>
> >>> @@ -1041,7 +1049,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> >>> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> >>>                            const struct btf_type *def,
> >>>                            const struct btf_member *m, __u32 *res) {
> >
> > [...]
> >
> >>> +struct bpf_core_spec {
> >>> +     const struct btf *btf;
> >>> +     /* high-level spec: named fields and array indicies only */
> >>
> >> typo: indices
> >
> > thanks!
> >
> >>
> >>> +     struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> >>> +     /* high-level spec length */
> >>> +     int len;
> >>> +     /* raw, low-level spec: 1-to-1 with accessor spec string */
> >>> +     int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> >>> +     /* raw spec length */
> >>> +     int raw_len;
> >>> +     /* field byte offset represented by spec */
> >>> +     __u32 offset;
> >>> +};
> >
> > [...]
> >
> >>> + *
> >>> + *   int x = &s->a[3]; // access string = '0:1:2:3'
> >>> + *
> >>> + * Low-level spec has 1:1 mapping with each element of access string (it's
> >>> + * just a parsed access string representation): [0, 1, 2, 3].
> >>> + *
> >>> + * High-level spec will capture only 3 points:
> >>> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> >>> + *   - field 'a' access (corresponds to '2' in low-level spec);
> >>> + *   - array element #3 access (corresponds to '3' in low-level spec).
> >>> + *
> >>> + */
> >>
> >> IIUC, high-level points are subset of low-level points. How about we introduce
> >> "anonymous" high-level points, so that high-level points and low-level points
> >> are 1:1 mapping?
> >
> > No, that will just hurt and complicate things. See above explanation
> > about why we need high-level points (it's what you as C programmer try
> > to achieve vs low-level spec is what C-language does in reality, with
> > all the anonymous struct/union traversal).
> >
> > What's wrong with this separation? Think about it as recording
> > "intent" (high-level spec) vs "mechanics" (low-level spec, how exactly
> > to achieve that intent, in excruciating details).
>
> There is nothing wrong with separation. I just personally think it is
> cleaner the other way. That's why I raised the question.
>
> I will go with your assessment, as you looked into this much more than
> I did. :-)

For me it's a machine view of the problem (raw spec) vs human view of
the problem (high-level spec, which resembles how you think about this
in C code). I'll keep it separate unless it proves to be problematic
going forward.

>
> [...]
>
> >>> +
> >>> +     memset(spec, 0, sizeof(*spec));
> >>> +     spec->btf = btf;
> >>> +
> >>> +     /* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> >>> +     while (*spec_str) {
> >>> +             if (*spec_str == ':')
> >>> +                     ++spec_str;
> >>> +             if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> >>> +                     return -EINVAL;
> >>> +             if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> >>> +                     return -E2BIG;
> >>> +             spec_str += parsed_len;
> >>> +             spec->raw_spec[spec->raw_len++] = access_idx;
> >>> +     }
> >>> +
> >>> +     if (spec->raw_len == 0)
> >>> +             return -EINVAL;
> >>> +
> >>> +     for (i = 0; i < spec->raw_len; i++) {
> >>> +             t = skip_mods_and_typedefs(btf, id, &id);
> >>> +             if (!t)
> >>> +                     return -EINVAL;
> >>> +
> >>> +             access_idx = spec->raw_spec[i];
> >>> +
> >>> +             if (i == 0) {
> >>> +                     /* first spec value is always reloc type array index */
> >>> +                     spec->spec[spec->len].type_id = id;
> >>> +                     spec->spec[spec->len].idx = access_idx;
> >>> +                     spec->len++;
> >>> +
> >>> +                     sz = btf__resolve_size(btf, id);
> >>> +                     if (sz < 0)
> >>> +                             return sz;
> >>> +                     spec->offset += access_idx * sz;
> >>          spec->offset = access_idx * sz;  should be enough
> >
> > No. spec->offset is carefully maintained across multiple low-level
> > steps, as we traverse down embedded structs/unions.
> >
> > Think about, e.g.:
> >
> > struct s {
> >    int a;
> >    struct {
> >        int b;
> >    };
> > };
> >
> > Imagine you are trying to match s.b access. With what you propose
> > you'll end up with offset 0, but it should be 4.
>
> Hmm... this is just for i == 0, right? Which line updated spec->offset
> after "memset(spec, 0, sizeof(*spec));"?

Ah, I missed that you are referring to the special i == 0 case. I can
do assignment, yes, you are right. I'll probably also extract it out
of the loop to make it less confusing.

>
> >
> >>
> >>> +                     continue;
> >>> +             }
> >>
> >> Maybe pull i == 0 case out of the for loop?
> >>
> >>> +
> >>> +             if (btf_is_composite(t)) {
> >
> > [...]
> >
> >>> +
> >>> +     if (spec->len == 0)
> >>> +             return -EINVAL;
> >>
> >> Can this ever happen?
> >
> > Not really, because I already check raw_len == 0 and exit with error.
> > I'll remove.
> >
> >>
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >
> > [...]
> >
> >>> +
> >>> +/*
> >>> + * Given single high-level accessor (either named field or array index) in
> >>> + * local type, find corresponding high-level accessor for a target type. Along
> >>> + * the way, maintain low-level spec for target as well. Also keep updating
> >>> + * target offset.
> >>> + */
> >>
> >> Please describe the recursive algorithm here. I am kinda lost.
> >
> > Explained above. I'll extend description a bit. But it's just
> > recursive exhaustive search:
> > 1. if struct field is anonymous and is struct/union, go one level
> > deeper and try to find field with given name inside those.
> > 2. if field has name and it matched what we are searching - check type
> > compatibility. It has to be compatible, so if it's not, then it's not
> > a match.
> >
> >> Also, please document the meaning of zero, positive, negative return values.
> >
> > Ok. It's standard <0 - error, 0 - false, 1 - true.
> >
> >>
> >>> +static int bpf_core_match_member(const struct btf *local_btf,
> >>> +                              const struct bpf_core_accessor *local_acc,
> >>> +                              const struct btf *targ_btf,
> >>> +                              __u32 targ_id,
> >>> +                              struct bpf_core_spec *spec,
> >>> +                              __u32 *next_targ_id)
> >>> +{
> >
> > [...]
> >
> >>> +             if (local_acc->name) {
> >>> +                     if (!btf_is_composite(targ_type))
> >>> +                             return 0;
> >>> +
> >>> +                     matched = bpf_core_match_member(local_spec->btf,
> >>> +                                                     local_acc,
> >>> +                                                     targ_btf, targ_id,
> >>> +                                                     targ_spec, &targ_id);
> >>> +                     if (matched <= 0)
> >>> +                             return matched;
> >>> +             } else {
> >>> +                     /* for i=0, targ_id is already treated as array element
> >>> +                      * type (because it's the original struct), for others
> >>> +                      * we should find array element type first
> >>> +                      */
> >>> +                     if (i > 0) {
> >>
> >> i == 0 case would go into "if (local_acc->name)" branch, no?
> >
> > No, i == 0 is always an array access. s->a.b.c is the same as
> > s[0].a.b.c, so relocation's first spec element is always either zero
> > for pointer access or any non-negative index for array access. But it
> > is always array access.
>
> I see. Thanks for the explanation.
>
> Song