On Thu, Sep 21, 2023 at 7:21 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote: > On 9/21/2023 6:17 AM, Florent Revest wrote: > > On Sun, Sep 17, 2023 at 5:09 PM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote: > >> Due to bpf only supports function arguments up to 16 bytes, according to > >> AAPCS64, starting from the first argument, each argument is first > >> attempted to be loaded to 1 or 2 smallest registers from x0-x7, if there > >> are no enough registers to hold the entire argument, then all remaining > >> arguments starting from this one are pushed to the stack for passing. > > > > If I read the section 6.8.2 of the AAPCS64 correctly, there is a > > corner case which I believe isn't covered by this logic. > > > > void f(u128 a, u128 b, u128, c, u64 d, u128 e, u64 f) {} > > - a goes on x0 and x1 > > - b goes on x2 and x3 > > - c goes on x4 and x5 > > - d goes on x6 > > - e spills on the stack because it doesn't fit in the remaining regs > > - f goes on x7 > > > > I guess you might have overlooked rule C.13 in AAPCS64. Non-floating type arguments > are copied to stack under rule C.15/C.17. However, C.13 is applied before C.15/C.17, > which means that NGRN is set to 8 before the stack is used. That is, all 8 parameter > arguments are used up and any remaining arguments can only be passed by the stack. > > C.13 The NGRN is set to 8. > > C.14 The NSAA is rounded up to the larger of 8 or the Natural Alignment of the > argument’s type. > > C.15 If the argument is a composite type then the argument is copied to memory > at the adjusted NSAA. The NSAA is incremented by the size of the argument. > The argument has now been allocated. > > C.16 If the size of the argument is less than 8 bytes then the size of the argument > is set to 8 bytes. The effect is as if the argument was copied to the least > significant bits of a 64-bit register and the remaining bits filled with > unspecified values. > > C.17 The argument is copied to memory at the adjusted NSAA. The NSAA is incremented > by the size of the argument. The argument has now been allocated. > > > And the following assembly code also shows 'e' and 'f' are passed by stack. > > int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f) > { > return e == 5 || f == 7; > } > > asseembly: > > func: > sub sp, sp, #64 > stp x0, x1, [sp, 48] > stp x2, x3, [sp, 32] > stp x4, x5, [sp, 16] > str x6, [sp, 8] > ldr x0, [sp, 64] // ** load the low 8 bytes of e from SP + 64 ** > cmp x0, 5 > bne .L27 > ldr x0, [sp, 72] // ** load the high 8 bytes of e from SP + 72 ** > cmp x0, 0 > beq .L22 > .L27: > ldr x0, [sp, 80] // ** load f from SP + 80 ** > cmp x0, 7 > bne .L24 > .L22: > mov w0, 1 > b .L26 > .L24: > mov w0, 0 > .L26: > add sp, sp, 64 > ret Ah, that's great! :) It keeps things easy then. Thank you for the explanation! > Although the above case is fine, the current patch does not handle rule C.14 correctly. > For example, for the following func, an 8-byte padding is inserted between f and g by > C.14 to align g to 16 bytes, but this patch mistakenly treats it as part of g. > > int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f, __int128 g) > { > } > > Maybe we could fix it by adding argument alignment to struct btf_func_model, I'll > give it a try. Good catch! > > Maybe it would be good to add something pathological like this to the > > selftests ? > > > > OK, will do Just a thought on that topic, maybe it would be preferable to have a new separate test for this ? Currently we have a test for 128b long arguments and a test for many arguments, it's good to have these separate because they are two dimensions of bpf architectural support: if someone adds support for one xor the other feature to an arch, it's good to see a test go green. Since that new test would cover both long and many arguments, it should probably be a new test. > >> -static void save_args(struct jit_ctx *ctx, int args_off, int nregs) > >> +struct arg_aux { > >> + /* how many args are passed through registers, the rest args are > > > > the rest of the* args > > > >> + * passed through stack > >> + */ > >> + int args_in_reg; > > > > Maybe args_in_regs ? since args can go in multiple regs > > > >> + /* how many registers used for passing arguments */ > > > > are* used > > > >> + int regs_for_arg; > > > > And here regs_for_args ? Since It's the number of registers used for all args > > > >> + /* how many stack slots used for arguments, each slot is 8 bytes */ > > > > are* used > > > >> + int stack_slots_for_arg; > > > > And here stack_slots_for_args, for the same reason as above? > > > >> +}; > >> + > >> +static void calc_arg_aux(const struct btf_func_model *m, > >> + struct arg_aux *a) > >> { > >> int i; > >> + int nregs; > >> + int slots; > >> + int stack_slots; > >> + > >> + /* verifier ensures m->nr_args <= MAX_BPF_FUNC_ARGS */ > >> + for (i = 0, nregs = 0; i < m->nr_args; i++) { > >> + slots = (m->arg_size[i] + 7) / 8; > >> + if (nregs + slots <= 8) /* passed through register ? */ > >> + nregs += slots; > >> + else > >> + break; > >> + } > >> + > >> + a->args_in_reg = i; > >> + a->regs_for_arg = nregs; > >> > >> - for (i = 0; i < nregs; i++) { > >> - emit(A64_STR64I(i, A64_SP, args_off), ctx); > >> - args_off += 8; > >> + /* the rest arguments are passed through stack */ > >> + for (stack_slots = 0; i < m->nr_args; i++) > >> + stack_slots += (m->arg_size[i] + 7) / 8; > >> + > >> + a->stack_slots_for_arg = stack_slots; > >> +} > >> + > >> +static void clear_garbage(struct jit_ctx *ctx, int reg, int effective_bytes) > >> +{ > >> + if (effective_bytes) { > >> + int garbage_bits = 64 - 8 * effective_bytes; > >> +#ifdef CONFIG_CPU_BIG_ENDIAN > >> + /* garbage bits are at the right end */ > >> + emit(A64_LSR(1, reg, reg, garbage_bits), ctx); > >> + emit(A64_LSL(1, reg, reg, garbage_bits), ctx); > >> +#else > >> + /* garbage bits are at the left end */ > >> + emit(A64_LSL(1, reg, reg, garbage_bits), ctx); > >> + emit(A64_LSR(1, reg, reg, garbage_bits), ctx); > >> +#endif > >> } > >> } > >> > >> -static void restore_args(struct jit_ctx *ctx, int args_off, int nregs) > >> +static void save_args(struct jit_ctx *ctx, int bargs_off, int oargs_off, > >> + const struct btf_func_model *m, > >> + const struct arg_aux *a, > >> + bool for_call_origin) > >> { > >> int i; > >> + int reg; > >> + int doff; > >> + int soff; > >> + int slots; > >> + u8 tmp = bpf2a64[TMP_REG_1]; > >> + > >> + /* store argument registers to stack for call bpf, or restore argument > > > > to* call bpf or "for the bpf program" > > > > Sorry for these incorrect words :(, all will be fixed in the next version, thanks! No problem! > >> static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, > >> struct bpf_tramp_links *tlinks, void *orig_call, > >> - int nregs, u32 flags) > >> + const struct btf_func_model *m, > >> + const struct arg_aux *a, > >> + u32 flags) > >> { > >> int i; > >> int stack_size; > >> int retaddr_off; > >> int regs_off; > >> int retval_off; > >> - int args_off; > >> + int bargs_off; > >> int nregs_off; > >> int ip_off; > >> int run_ctx_off; > >> + int oargs_off; > >> + int nregs; > >> struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY]; > >> struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT]; > >> struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN]; > >> @@ -1859,19 +1951,26 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, > >> * > >> * SP + retval_off [ return value ] BPF_TRAMP_F_CALL_ORIG or > >> * BPF_TRAMP_F_RET_FENTRY_RET > >> - * > >> * [ arg reg N ] > >> * [ ... ] > >> - * SP + args_off [ arg reg 1 ] > >> + * SP + bargs_off [ arg reg 1 ] for bpf > >> * > >> * SP + nregs_off [ arg regs count ] > >> * > >> * SP + ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag > >> * > >> * SP + run_ctx_off [ bpf_tramp_run_ctx ] > >> + * > >> + * [ stack arg N ] > >> + * [ ... ] > >> + * SP + oargs_off [ stack arg 1 ] for original func > >> */ > >> > >> stack_size = 0; > >> + oargs_off = stack_size; > >> + if (flags & BPF_TRAMP_F_CALL_ORIG) > >> + stack_size += 8 * a->stack_slots_for_arg; > >> + > >> run_ctx_off = stack_size; > >> /* room for bpf_tramp_run_ctx */ > >> stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8); > >> @@ -1885,9 +1984,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, > >> /* room for args count */ > >> stack_size += 8; > >> > >> - args_off = stack_size; > >> + bargs_off = stack_size; > >> /* room for args */ > >> - stack_size += nregs * 8; > >> + nregs = a->regs_for_arg + a->stack_slots_for_arg; > > > > Maybe this name no longer makes sense ? > > OK, I'll remove it Ack