On Mon, Apr 26, 2021 at 8:25 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Apr 26, 2021 at 03:22:36PM -0700, Andrii Nakryiko wrote: > > On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > > > The BPF program loading process performed by libbpf is quite complex > > > and consists of the following steps: > > > "open" phase: > > > - parse elf file and remember relocations, sections > > > - collect externs and ksyms including their btf_ids in prog's BTF > > > - patch BTF datasec (since llvm couldn't do it) > > > - init maps (old style map_def, BTF based, global data map, kconfig map) > > > - collect relocations against progs and maps > > > "load" phase: > > > - probe kernel features > > > - load vmlinux BTF > > > - resolve externs (kconfig and ksym) > > > - load program BTF > > > - init struct_ops > > > - create maps > > > - apply CO-RE relocations > > > - patch ld_imm64 insns with src_reg=PSEUDO_MAP, PSEUDO_MAP_VALUE, PSEUDO_BTF_ID > > > - reposition subprograms and adjust call insns > > > - sanitize and load progs > > > > > > During this process libbpf does sys_bpf() calls to load BTF, create maps, > > > populate maps and finally load programs. > > > Instead of actually doing the syscalls generate a trace of what libbpf > > > would have done and represent it as the "loader program". > > > The "loader program" consists of single map with: > > > - union bpf_attr(s) > > > - BTF bytes > > > - map value bytes > > > - insns bytes > > > and single bpf program that passes bpf_attr(s) and data into bpf_sys_bpf() helper. > > > Executing such "loader program" via bpf_prog_test_run() command will > > > replay the sequence of syscalls that libbpf would have done which will result > > > the same maps created and programs loaded as specified in the elf file. > > > The "loader program" removes libelf and majority of libbpf dependency from > > > program loading process. > > > > > > kconfig, typeless ksym, struct_ops and CO-RE are not supported yet. > > > > > > The order of relocate_data and relocate_calls had to change, so that > > > bpf_gen__prog_load() can see all relocations for a given program with > > > correct insn_idx-es. > > > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > > --- > > > tools/lib/bpf/Build | 2 +- > > > tools/lib/bpf/bpf_gen_internal.h | 40 ++ > > > tools/lib/bpf/gen_loader.c | 615 +++++++++++++++++++++++++++++++ > > > tools/lib/bpf/libbpf.c | 204 ++++++++-- > > > tools/lib/bpf/libbpf.h | 12 + > > > tools/lib/bpf/libbpf.map | 1 + > > > tools/lib/bpf/libbpf_internal.h | 2 + > > > tools/lib/bpf/skel_internal.h | 105 ++++++ > > > 8 files changed, 948 insertions(+), 33 deletions(-) > > > create mode 100644 tools/lib/bpf/bpf_gen_internal.h > > > create mode 100644 tools/lib/bpf/gen_loader.c > > > create mode 100644 tools/lib/bpf/skel_internal.h > > > > > > diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build > > > index 9b057cc7650a..430f6874fa41 100644 > > > --- a/tools/lib/bpf/Build > > > +++ b/tools/lib/bpf/Build > > > @@ -1,3 +1,3 @@ > > > libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \ > > > netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \ > > > - btf_dump.o ringbuf.o strset.o linker.o > > > + btf_dump.o ringbuf.o strset.o linker.o gen_loader.o > > > diff --git a/tools/lib/bpf/bpf_gen_internal.h b/tools/lib/bpf/bpf_gen_internal.h > > > new file mode 100644 > > > index 000000000000..dc3e2cbf9ce3 > > > --- /dev/null > > > +++ b/tools/lib/bpf/bpf_gen_internal.h > > > @@ -0,0 +1,40 @@ > > > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > > > +/* Copyright (c) 2021 Facebook */ > > > +#ifndef __BPF_GEN_INTERNAL_H > > > +#define __BPF_GEN_INTERNAL_H > > > + > > > +struct relo_desc { > > > > there is very similarly named reloc_desc struct in libbpf.c, can you > > rename it to something like gen_btf_relo_desc? > > sure. > > > > + const char *name; > > > + int kind; > > > + int insn_idx; > > > +}; > > > + > > > > [...] > > > > > + > > > +static int bpf_gen__realloc_insn_buf(struct bpf_gen *gen, __u32 size) > > > +{ > > > + size_t off = gen->insn_cur - gen->insn_start; > > > + > > > + if (gen->error) > > > + return gen->error; > > > + if (size > INT32_MAX || off + size > INT32_MAX) { > > > + gen->error = -ERANGE; > > > + return -ERANGE; > > > + } > > > + gen->insn_start = realloc(gen->insn_start, off + size); > > > > leaking memory here: gen->insn_start will be NULL on failure > > ohh. good catch. > > > > + if (!gen->insn_start) { > > > + gen->error = -ENOMEM; > > > + return -ENOMEM; > > > + } > > > + gen->insn_cur = gen->insn_start + off; > > > + return 0; > > > +} > > > + > > > +static int bpf_gen__realloc_data_buf(struct bpf_gen *gen, __u32 size) > > > +{ > > > + size_t off = gen->data_cur - gen->data_start; > > > + > > > + if (gen->error) > > > + return gen->error; > > > + if (size > INT32_MAX || off + size > INT32_MAX) { > > > + gen->error = -ERANGE; > > > + return -ERANGE; > > > + } > > > + gen->data_start = realloc(gen->data_start, off + size); > > > > same as above > > > > > + if (!gen->data_start) { > > > + gen->error = -ENOMEM; > > > + return -ENOMEM; > > > + } > > > + gen->data_cur = gen->data_start + off; > > > + return 0; > > > +} > > > + > > > > [...] > > > > > + > > > +static void bpf_gen__emit_sys_bpf(struct bpf_gen *gen, int cmd, int attr, int attr_size) > > > +{ > > > + bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_1, cmd)); > > > + bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_2, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, attr)); > > > > is attr an offset into a blob? if yes, attr_off? or attr_base_off, > > anything with _off > > yes. it's an offset into a blob, but I don't use _off anywhere > otherwise all variables through out would have to have _off which is too verbose. After reading bpf_gen__emit_rel_store() which uses "off" I assumed offset terminology will be used everywhere, but I got used to that by the end I finished reading, so I don't care anymore. > > > > + bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_3, attr_size)); > > > + bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_sys_bpf)); > > > + /* remember the result in R7 */ > > > + bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_7, BPF_REG_0)); > > > +} > > > + > > > +static void bpf_gen__emit_check_err(struct bpf_gen *gen) > > > +{ > > > + bpf_gen__emit(gen, BPF_JMP_IMM(BPF_JSGE, BPF_REG_7, 0, 2)); > > > + bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_0, BPF_REG_7)); > > > + bpf_gen__emit(gen, BPF_EXIT_INSN()); > > > +} > > > + > > > +static void __bpf_gen__debug(struct bpf_gen *gen, int reg1, int reg2, const char *fmt, va_list args) > > > > Can you please leave a comment on what reg1 and reg2 is, it's not very > > clear and the code clearly assumes that it can't be reg[1-4]. It's > > probably those special R7 and R9 (or -1, of course), but having a > > short comment makes sense to not jump around trying to figure out > > possible inputs. > > > > Oh, reading further, it can also be R0. > > good point. will add a comment. > > > > +{ > > > + char buf[1024]; > > > + int addr, len, ret; > > > + > > > + if (!gen->log_level) > > > + return; > > > + ret = vsnprintf(buf, sizeof(buf), fmt, args); > > > + if (ret < 1024 - 7 && reg1 >= 0 && reg2 < 0) > > > + /* The special case to accommodate common bpf_gen__debug_ret(): > > > + * to avoid specifying BPF_REG_7 and adding " r=%%d" to prints explicitly. > > > + */ > > > + strcat(buf, " r=%d"); > > > + len = strlen(buf) + 1; > > > + addr = bpf_gen__add_data(gen, buf, len); > > > > nit: offset, not address, right? > > it's actually an address. > From pov of the program and insn below: > > > > + bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_1, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, addr)); > > I guess it's kinda both. It's an offset within global data, but no one calls > int var; > &var -> taking an offset here? no. It's taking an address of the glob var. > Our implementation of global data is via single map value element. > So global vars have offsets too. > imo addr is more accurate here and through out this file. alright > > > > + bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_2, len)); > > > + if (reg1 >= 0) > > > + bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_3, reg1)); > > > + if (reg2 >= 0) > > > + bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_4, reg2)); > > > + bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_trace_printk)); > > > +} > > > + > > > > [...] > > > > > +int bpf_gen__finish(struct bpf_gen *gen) > > > +{ > > > + int i; > > > + > > > + bpf_gen__emit_sys_close_stack(gen, stack_off(btf_fd)); > > > + for (i = 0; i < gen->nr_progs; i++) > > > + bpf_gen__move_stack2ctx(gen, > > > + sizeof(struct bpf_loader_ctx) + > > > + sizeof(struct bpf_map_desc) * gen->nr_maps + > > > + sizeof(struct bpf_prog_desc) * i + > > > + offsetof(struct bpf_prog_desc, prog_fd), 4, > > > + stack_off(prog_fd[i])); > > > + for (i = 0; i < gen->nr_maps; i++) > > > + bpf_gen__move_stack2ctx(gen, > > > + sizeof(struct bpf_loader_ctx) + > > > + sizeof(struct bpf_map_desc) * i + > > > + offsetof(struct bpf_map_desc, map_fd), 4, > > > + stack_off(map_fd[i])); > > > + bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_0, 0)); > > > + bpf_gen__emit(gen, BPF_EXIT_INSN()); > > > + pr_debug("bpf_gen__finish %d\n", gen->error); > > > > maybe prefix all those pr_debug()s with "gen: " to distinguish them > > from the rest of libbpf logging? > > sure. > > > > + if (!gen->error) { > > > + struct gen_loader_opts *opts = gen->opts; > > > + > > > + opts->insns = gen->insn_start; > > > + opts->insns_sz = gen->insn_cur - gen->insn_start; > > > + opts->data = gen->data_start; > > > + opts->data_sz = gen->data_cur - gen->data_start; > > > + } > > > + return gen->error; > > > +} > > > + > > > +void bpf_gen__free(struct bpf_gen *gen) > > > +{ > > > + if (!gen) > > > + return; > > > + free(gen->data_start); > > > + free(gen->insn_start); > > > + gen->data_start = NULL; > > > + gen->insn_start = NULL; > > > > what's the point of NULL'ing them out if you don't clear gen->data_cur > > and gen->insn_cur? > > To spot the bugs quicker if there are issues. > > > also should it free(gen) itself? > > ohh. bpf_object__close() should probably do it to stay symmetrical with calloc. in libbpf usually xxx__free() frees the object itself, so doing free(gen) here would be consistent with that, IMO > > > > +} > > > + > > > +void bpf_gen__load_btf(struct bpf_gen *gen, const void *btf_raw_data, __u32 btf_raw_size) > > > +{ > > > + union bpf_attr attr = {}; > > > > here and below: memset(0)? > > that's unnecessary. there is no backward/forward compat issue here. > and bpf_attr doesn't have gaps inside. systemd definitely had a problem with non-zero padding with such usage of bpf_attr recently, but I don't remember which command specifically. Is there any downside to making sure that this will keep working for later bpf_attr changes regardless of whether there are gaps or not? > > > > + int attr_size = offsetofend(union bpf_attr, btf_log_level); > > > + int btf_data, btf_load_attr; > > > + > > > + pr_debug("btf_load: size %d\n", btf_raw_size); > > > + btf_data = bpf_gen__add_data(gen, btf_raw_data, btf_raw_size); > > > + > > > > [...] > > > > > + map_create_attr = bpf_gen__add_data(gen, &attr, attr_size); > > > + if (attr.btf_value_type_id) > > > + /* populate union bpf_attr with btf_fd saved in the stack earlier */ > > > + bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, btf_fd), 4, > > > + stack_off(btf_fd)); > > > + switch (attr.map_type) { > > > + case BPF_MAP_TYPE_ARRAY_OF_MAPS: > > > + case BPF_MAP_TYPE_HASH_OF_MAPS: > > > + bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, inner_map_fd), > > > + 4, stack_off(inner_map_fd)); > > > + close_inner_map_fd = true; > > > + break; > > > + default:; > > > > default: > > break; > > why? because it's consistent with all the code in libbpf: $ rg --multiline 'default:\s*\n\s*break;' tools/lib/bpf returns quite a few cases, while $ rg --multiline 'default:\s*;' tools/lib/bpf returns none > > > > + } > > > + /* emit MAP_CREATE command */ > > > + bpf_gen__emit_sys_bpf(gen, BPF_MAP_CREATE, map_create_attr, attr_size); > > > + bpf_gen__debug_ret(gen, "map_create %s idx %d type %d value_size %d", > > > + attr.map_name, map_idx, map_attr->map_type, attr.value_size); > > > + bpf_gen__emit_check_err(gen); > > > > what will happen on error with inner_map_fd and all the other fds > > created by now? > > that's a todo item. I'll add a comment that error path is far from perfect. ok > > > > + /* remember map_fd in the stack, if successful */ > > > + if (map_idx < 0) { > > > + /* This bpf_gen__map_create() function is called with map_idx >= 0 for all maps > > > + * that libbpf loading logic tracks. > > > + * It's called with -1 to create an inner map. > > > + */ > > > + bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(inner_map_fd))); > > > + } else { > > > + if (map_idx != gen->nr_maps) { > > > > why would that happen? defensive programming? and even then `if () {} > > else if () {} else {}` structure is more appropriate > > sure. will use that style. > > > > + gen->error = -EDOM; /* internal bug */ > > > + return; > > > + } > > > + bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(map_fd[map_idx]))); > > > + gen->nr_maps++; > > > + } > > > + if (close_inner_map_fd) > > > + bpf_gen__emit_sys_close_stack(gen, stack_off(inner_map_fd)); > > > +} > > > + > > > > [...] > > > > > +static void bpf_gen__cleanup_relos(struct bpf_gen *gen, int insns) > > > +{ > > > + int i, insn; > > > + > > > + for (i = 0; i < gen->relo_cnt; i++) { > > > + if (gen->relos[i].kind != BTF_KIND_VAR) > > > + continue; > > > + /* close fd recorded in insn[insn_idx + 1].imm */ > > > + insn = insns + sizeof(struct bpf_insn) * (gen->relos[i].insn_idx + 1) > > > + + offsetof(struct bpf_insn, imm); > > > + bpf_gen__emit_sys_close_blob(gen, insn); > > > > wouldn't this close the same FD used across multiple "relos" multiple times? > > no. since every relo has its own fd. > Right now the loader gen is simple and doesn't do preprocessing of > all relos for all progs to avoid complicating the code. > It's good enough for now. I keep forgetting about this split of "this will happen much later in kernel" vs "this is what libbpf is normally doing". This is going to be fun to support. > > > > + } > > > + if (gen->relo_cnt) { > > > + free(gen->relos); > > > + gen->relo_cnt = 0; > > > + gen->relos = NULL; > > > + } > > > +} > > > + > > > > [...] > > > > > + struct bpf_gen *gen_loader; > > > + > > > /* > > > * Information when doing elf related work. Only valid if fd > > > * is valid. > > > @@ -2651,7 +2654,15 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj) > > > bpf_object__sanitize_btf(obj, kern_btf); > > > } > > > > > > - err = btf__load(kern_btf); > > > + if (obj->gen_loader) { > > > + __u32 raw_size = 0; > > > + const void *raw_data = btf__get_raw_data(kern_btf, &raw_size); > > > > this can return NULL on ENOMEM > > good point. > > > > + > > > + bpf_gen__load_btf(obj->gen_loader, raw_data, raw_size); > > > + btf__set_fd(kern_btf, 0); > > > > why setting fd to 0 (stdin)? does gen depend on this somewhere? The > > problem is that it will eventually be closed on btf__free(), which > > will close stdin, causing a big surprise. What will happen if you > > leave it at -1? > > unfortunately there are various piece of the code that check <0 > means not init. > I can try to fix them all, but it felt unncessary complex vs screwing up stdin. > It's bpftool's stdin, but you're right that it's not pretty. > I can probably use special very large FD number and check for it. > Still cleaner than fixing all checks. Maybe after generation go over each prog/map/BTF and reset their FDs to -1 if gen_loader is used? Or I guess we can just sprinkle if (!obj->gen_loader) close(fd); in the right places. Not great from code readability, but at least won't have spurious close()s. > > > > > > + } else { > > > + err = btf__load(kern_btf); > > > + } > > > if (sanitize) { > > > if (!err) { > > > /* move fd to libbpf's BTF */ > > > @@ -4262,6 +4273,12 @@ static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id f > > > struct kern_feature_desc *feat = &feature_probes[feat_id]; > > > int ret; > > > > > > + if (obj->gen_loader) > > > + /* To generate loader program assume the latest kernel > > > + * to avoid doing extra prog_load, map_create syscalls. > > > + */ > > > + return true; > > > + > > > if (READ_ONCE(feat->res) == FEAT_UNKNOWN) { > > > ret = feat->probe(); > > > if (ret > 0) { > > > @@ -4344,6 +4361,13 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map) > > > char *cp, errmsg[STRERR_BUFSIZE]; > > > int err, zero = 0; > > > > > > + if (obj->gen_loader) { > > > + bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps, > > > > it would be great for bpf_gen__map_update_elem to reflect that it's > > not a generic map_update_elem() call, rather special internal map > > update (just use bpf_gen__populate_internal_map?) Whether to freeze or > > not could be just a flag to the same call, they always go together. > > It's actually generic map_update_elem. I haven't used it in inner map init yet. > Still on my todo list. Right now it assumes the key is zero and sizeof(int), I was wondering if you are going to capture generic key in the future. Ok. > > > > + map->mmaped, map->def.value_size); > > > + if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG) > > > + bpf_gen__map_freeze(obj->gen_loader, map - obj->maps); > > > + return 0; > > > + } > > > err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0); > > > if (err) { > > > err = -errno; [...] > > > @@ -9387,7 +9521,13 @@ static int libbpf_find_attach_btf_id(struct bpf_program *prog, int *btf_obj_fd, > > > } > > > > > > /* kernel/module BTF ID */ > > > - err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id); > > > + if (prog->obj->gen_loader) { > > > + bpf_gen__record_attach_target(prog->obj->gen_loader, attach_name, attach_type); > > > + *btf_obj_fd = 0; > > > > this will leak kernel module BTF FDs > > I don't follow. > When gen_loader is happening the find_kernel_btf_id() is not called and modules BTFs are not loaded. Yeah, my bad again, I misread diff as if find_kernel_btf_id() was still called before. Keep forgetting about this "in the kernel, in the future" semantics. Never mind. > > > > + *btf_type_id = 1; > > > + } else { > > > + err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id); > > > + } > > > if (err) { > > > pr_warn("failed to find kernel BTF type ID of '%s': %d\n", attach_name, err); > > > return err; > > > > [...] > > > > > +out: > > > + close(map_fd); > > > + close(prog_fd); > > > > this does close(-1), check >= 0 > > Right. I felt there is no risk in doing that. I guess extra check is fine too. We try to not do this even in selftests, so good to not have spurious close()s.