On Mon, Nov 29, 2021 at 05:03:37PM -0800, Andrii Nakryiko wrote: > On Tue, Nov 23, 2021 at 10:02 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Given BPF program's BTF root type name perform the following steps: > > . search in vmlinux candidate cache. > > . if (present in cache and candidate list >= 1) return candidate list. > > . do a linear search through kernel BTFs for possible candidates. > > . regardless of number of candidates found populate vmlinux cache. > > . if (candidate list >= 1) return candidate list. > > . search in module candidate cache. > > . if (present in cache) return candidate list (even if list is empty). > > . do a linear search through BTFs of all kernel modules > > collecting candidates from all of them. > > . regardless of number of candidates found populate module cache. > > . return candidate list. > > Then wire the result into bpf_core_apply_relo_insn(). > > > > When BPF program is trying to CO-RE relocate a type > > that doesn't exist in either vmlinux BTF or in modules BTFs > > these steps will perform 2 cache lookups when cache is hit. > > > > Note the cache doesn't prevent the abuse by the program that might > > have lots of relocations that cannot be resolved. Hence cond_resched(). > > > > CO-RE in the kernel requires CAP_BPF, since BTF loading requires it. > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > --- > > kernel/bpf/btf.c | 250 +++++++++++++++++++++++++++++++++++++- > > tools/lib/bpf/relo_core.h | 2 + > > 2 files changed, 251 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > > index dbf1f389b1d3..cf971b8a0769 100644 > > --- a/kernel/bpf/btf.c > > +++ b/kernel/bpf/btf.c > > @@ -25,6 +25,7 @@ > > #include <linux/kobject.h> > > #include <linux/sysfs.h> > > #include <net/sock.h> > > +#include "../tools/lib/bpf/relo_core.h" > > > > [...] > > > +static void populate_cand_cache(struct bpf_cand_cache *cands, > > + struct bpf_cand_cache **cache, > > + int cache_size) > > +{ > > + u32 hash = jhash_2words(cands->name_len, > > + (((u32) cands->name[0]) << 8) | cands->name[1], 0); > > maybe add a helper func to calculate the hash given struct > bpf_cand_cache to keep the logic always in sync? I felt it's trivial enough, but sure I can do that. > > + struct bpf_cand_cache *cc = cache[hash % cache_size]; > > + > > + if (cc) > > + bpf_free_cands(cc); > > + cache[hash % cache_size] = cands; > > +} > > + > > [...] > > > +static struct bpf_cand_cache * > > +bpf_core_find_cands(struct bpf_core_ctx *ctx, u32 local_type_id) > > +{ > > + const struct btf *local_btf = ctx->btf; > > + const struct btf_type *local_type; > > + const struct btf *main_btf; > > + size_t local_essent_len; > > + struct bpf_cand_cache *cands, *cc; > > + struct btf *mod_btf; > > + const char *name; > > + int id; > > + > > + local_type = btf_type_by_id(local_btf, local_type_id); > > + if (!local_type) > > + return ERR_PTR(-EINVAL); > > + > > + name = btf_name_by_offset(local_btf, local_type->name_off); > > + if (str_is_empty(name)) > > + return ERR_PTR(-EINVAL); > > + local_essent_len = bpf_core_essential_name_len(name); > > + > > + cands = kcalloc(1, sizeof(*cands), GFP_KERNEL); > > + if (!cands) > > + return ERR_PTR(-ENOMEM); > > + cands->name = kmemdup_nul(name, local_essent_len, GFP_KERNEL); > > it's pretty minor, but you don't really need kmemdup_nul() until > populate_cand_cache(), you can use name as is until you really need to > cache cands I thought about it, but didn't do it, because it complicates the code: bpf_core_add_cands() would somehow need to differentiate whether cands->name came as a 1st call into bpf_core_add_cands() or it's a subsequent call. It could be a flag in struct bpf_cand_cache that will tell whether bpf_free_cands() needs to be freed or not, but feels unnecessary complex. > > > + if (!cands->name) { > > + kfree(cands); > > + return ERR_PTR(-ENOMEM); > > + } > > + cands->kind = btf_kind(local_type); > > + cands->name_len = local_essent_len; > > + > > + cc = check_cand_cache(cands, vmlinux_cand_cache, VMLINUX_CAND_CACHE_SIZE); > > + if (cc) { > > + if (cc->cnt) { > > + bpf_free_cands(cands); > > + return cc; > > + } > > + goto check_modules; > > + } > > + > > + /* Attempt to find target candidates in vmlinux BTF first */ > > + main_btf = bpf_get_btf_vmlinux(); > > + cands = bpf_core_add_cands(cands, main_btf, 1); > > + if (IS_ERR(cands)) > > + return cands; > > + > > + /* populate cache even when cands->cnt == 0 */ > > + populate_cand_cache(cands, vmlinux_cand_cache, VMLINUX_CAND_CACHE_SIZE); > > + > > + /* if vmlinux BTF has any candidate, don't go for module BTFs */ > > + if (cands->cnt) > > + return cands; > > + > > +check_modules: > > + cc = check_cand_cache(cands, module_cand_cache, MODULE_CAND_CACHE_SIZE); > > + if (cc) { > > + bpf_free_cands(cands); > > + /* if cache has it return it even if cc->cnt == 0 */ > > + return cc; > > + } > > + > > + /* If candidate is not found in vmlinux's BTF then search in module's BTFs */ > > + spin_lock_bh(&btf_idr_lock); > > + idr_for_each_entry(&btf_idr, mod_btf, id) { > > + if (!btf_is_module(mod_btf)) > > + continue; > > + /* linear search could be slow hence unlock/lock > > + * the IDR to avoiding holding it for too long > > + */ > > + btf_get(mod_btf); > > + spin_unlock_bh(&btf_idr_lock); > > + cands = bpf_core_add_cands(cands, mod_btf, btf_nr_types(main_btf)); > > + if (IS_ERR(cands)) { > > + btf_put(mod_btf); > > + return cands; > > + } > > + spin_lock_bh(&btf_idr_lock); > > + btf_put(mod_btf); > > either need to additionally btf_get(mod_btf) inside > bpf_core_add_cands() not btf_put() it here if you added at least one > candidate, as you are storing targ_btf inside bpf_core_add_cands() and > dropping refcount might leave dangling pointer Module will not go away while cands are being searched and cache ops are done. purge_cand_cache() is called from MODULE_STATE_GOING. I've considered doing the purge from btf_put(), but we don't guarantee the context in there, so mutex_lock would complicate btf_put too much. It's simpler to do purge from MODULE_STATE_GOING. But more below... > > + for (i = 0; i < cc->cnt; i++) { > > + bpf_log(ctx->log, > > + "CO-RE relocating %s %s: found target candidate [%d]\n", > > + btf_kind_str[cc->kind], cc->name, cc->cands[i].id); > > + cands.cands[i].btf = cc->cands[i].btf; > > + cands.cands[i].id = cc->cands[i].id; > > + } > > + cands.len = cc->cnt; > > + mutex_unlock(&cand_cache_mutex); > > + } > > + > > cache is not locked at this point, so those cands.cands[i].btf objects > might be freed now (if module got unloaded meanwhile), right? right, looks easier to do btf_get here while copying the pointer. and do a loop of btf_put after bpf_core_apply_relo_insn. Doing btf_get inside bpf_core_add_cands adds complexity to cache replacement and purging. Right now __purge_cand_cache is just kfree of the whole slot with multiple btf pointers in there potentially from different modules. With btf_get in add_cands it would need to be a loop and bpf_free_cands would need to have a loop. btw the earlier versions of this patch set had the same issue, so thanks for the good catch! > This global sharing of that small cache seems to cause unnecessary > headaches, tbh. It adds global mutex (which might also block for > kcalloc). If you used that cache locally for processing single > bpf_prog, you wouldn't need the locking. It can probably also simplify > the refcounting, especially if you just btf_get(targ_btf) for each > candidate and then btf_put() it after all relos are processed. You are > also half-step away from removing the size restriction (just chain > bpf_cand_caches together) and having a fixed bucket-size hash with > non-fixed chain length (which probably would be totally fine for all > practical purposes). and that would be a almost done hashtable? why add that complexity? The size restriction is necessary anyway for a global cache. Even for per-program hashtable some size restriction might be necessary. CAP_BPF is a target for "researchers" to do weird things with the kernel. > > > > + err = bpf_core_apply_relo_insn((void *)ctx->log, insn, relo->insn_off / 8, > > + relo, relo_idx, ctx->btf, &cands); > > + kfree(cands.cands); > > + return err; > > } > > diff --git a/tools/lib/bpf/relo_core.h b/tools/lib/bpf/relo_core.h > > index f410691cc4e5..f7b0d698978c 100644 > > --- a/tools/lib/bpf/relo_core.h > > +++ b/tools/lib/bpf/relo_core.h > > @@ -8,8 +8,10 @@ > > > > struct bpf_core_cand { > > const struct btf *btf; > > +#ifndef __KERNEL__ > > const struct btf_type *t; > > const char *name; > > +#endif > > why? doesn't seem to be used and both t and name could be derived from > btf and id exactly, that's why CO-RE in the kernel doesn't use them, but libbpf uses both for debugging and for passing information back and forth between layers.