On Wed, Dec 06, 2023 at 07:37:13PM +0100, Peter Zijlstra wrote: > On Wed, Dec 06, 2023 at 05:38:14PM +0100, Peter Zijlstra wrote: > > On Mon, Dec 04, 2023 at 05:18:31PM -0800, Alexei Starovoitov wrote: > > > > > [ 13.978497] ? asm_exc_invalid_op+0x1a/0x20 > > > [ 13.978798] ? tcp_set_ca_state+0x51/0xd0 > > > [ 13.979087] tcp_v6_syn_recv_sock+0x45c/0x6c0 > > > [ 13.979401] tcp_check_req+0x497/0x590 > > > > > The stack trace doesn't have any bpf, but it's a bpf issue too. > > > Here tcp_set_ca_state() calls > > > icsk->icsk_ca_ops->set_state(sk, ca_state); > > > which calls bpf prog via bpf trampoline. > > > > > > > > Specifically, I think this is > > tools/testing/selftests/bpf/progs/bpf_cubic.c, which has: > > > > .set_state = (void *)bpf_cubic_state, > > > > which comes from: > > > > BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 *new_state) > > > > which then wraps: > > > > BPF_PROG() > > > > which ends up generating: > > > > static __always_inline ___bpf_cubic_state(unsigned long long *ctx, struct sock *sk, __u8 *new_state) > > { > > ... > > } > > > > void bpf_cubic_state(unsigned long long *ctx) > > { > > return ____bpf_cubic_state(ctx, ctx[0], ctx[1]); > > } Yep. That's correct. > > I think this then uses arch_prepare_bpf_trampoline(), but I'm entirely > > lost how this all comes together, because the way I understand it the > > whole bpf_trampoline is used to hook into an ftrace __fentry hook. > > > > And a __fentry hook is very much not a function pointer. Help!?!? > > kernel/bpf/bpf_struct_ops.c:bpf_struct_ops_prepare_trampoline() > > And yeah, it seems to use the ftrace trampoline for indirect calls here, > *sigh*. Not quite. bpf_struct_ops_prepare_trampoline() prepares a trampoline that does conversion from native calling convention to bpf calling convention. We could have optimized it for the case of x86-64 and num_args <= 5 and it would be a nop trampoline, but so far it's generic and works on x86-64, arm64, etc. There were patches posted to make it work on 32-bit archs too (not landed yet). All native args are stored one by one into u64 ctx[0], u64 ctx[1], on stack and then bpf prog is called with a single ctx pointer. For example for the case of struct tcp_congestion_ops { void (*set_state)(struct sock *sk, u8 new_state); } The translation of 'sk' into ctx[0] is based on 'struct btf_func_model' which is discovered from 'set_state' func prototype as stored in BTF. So the trampoline for set_state, copies %rdi into ctx[0] and %rsi into ctx[1] and _directly_ calls bpf_cubic_state(). Note arch_prepare_bpf_trampoline() emits ENDBR as the first insn, because the pointer to this trampoline is directly stored in 'struct tcp_congestion_ops'. Later from TCP stack point of view 'icsk_ca_ops' are exactly the same for built-in cong controls (CCs), kernel module's CCs and bpf-based CCs. All calls to struct tcp_congestion_ops callbacks are normal indirect calls. Different CCs have different struct tcp_congestion_ops with their own pointers to functions, of course. There is no ftrace here at all. No .text live patching either. All is ok until kCFI comes into picture. Here we probably need to teach arch_prepare_bpf_trampoline() to emit different __kcfi_typeid depending on kernel function proto, so that caller hash checking logic won't be tripped. I suspect that requires to reverse engineer an algorithm of computing kcfi from clang. other ideas? > > The other case: > > > > For tools/testing/selftests/bpf/progs/bloom_filter_bench.c we have: > > > > bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); > > > > and here bloom callback appears like a normal function: > > > > static __u64 > > bloom_callback(struct bpf_map *map, __u32 *key, void *val, > > struct callback_ctx *data) > > > > > > But what do functions looks like in the JIT? What's the actual address > > that's then passed into the helper function. Given this seems to work > > without kCFI, it should at least have an ENDBR, but there's only 3 of > > those afaict: Right. That is very different from struct_ops/trampoline. There is no trampoline here at all. A bpf prog is JITed as normal, but its prototype is technically bpf_callback_t. We do the same JITing for all progs. Both main entry prog and subprograms. They all are treated as accepting 5 u64 args and returning single u64. For the main prog the prototype: bpf_prog_runX(u64 *regs, const struct bpf_insn *insn) is partially true, since 2nd arg is never used and the 1st arg is 'ctx'. So from x86-64 JIT pov there is no difference whether it's single 8-byte arg or five 8-byte args. In the case of bpf_for_each_map_elem() the 'bloom_callback' is a subprog of bpf_callback_t type. So the kernel is doing: ret = callback_fn((u64)(long)map, (u64)(long)&key, (u64)(long)val, (u64)(long)callback_ctx, 0); and that works on all archs including 32-bit. The kernel is doing conversion from native calling convention to bpf calling convention and for lucky archs like x86-64 the conversion is a true nop. It's a plain indirect call to JITed bpf prog. Note there is no interpreter support here. This works on archs with JITs only. No ftrace and no trampoline. This case is easier to make work with kCFI. The JIT will use: cfi_bpf_hash: .long __kcfi_typeid___bpf_prog_runX like your patch already does. And will use extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64); cfi_bpf_subprog_hash: .long __kcfi_typeid___bpf_callback_fn to JIT all subprogs. See bpf_is_subprog(). Which shouldn't be too difficult to add. We'd need to tweak the verifier to make sure bpf_for_each_map_elem and friends never callback the main subprog which is technically the case today. Just need to add more guardrails. I can work on this. btw there are two patchsets in progress that will touch core bits of JITs. This one: https://patchwork.kernel.org/project/netdevbpf/cover/20231201190654.1233153-1-song@xxxxxxxxxx/ and this one: https://patchwork.kernel.org/project/netdevbpf/cover/20231011152725.95895-1-hffilwlqm@xxxxxxxxx/ so do you mind resending your current set with get_cfi_offset() change and I can land it into bpf-next, so we can fix one bug at a time, build on top, and avoid conflicts? The more we dig the more it looks like that the follow up you planned to do on top of this set isn't going to happen soon. So should be ok going through bpf-next and then you can follow up with x86 things after merge window?