Re: [PATCH bpf-next] bpf, arm64: Support up to 12 function arguments

Florent Revest <revest@xxxxxxxxxxxx> · Thu, 21 Sep 2023 12:09:14 +0200

On Thu, Sep 21, 2023 at 7:21 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote:
> On 9/21/2023 6:17 AM, Florent Revest wrote:
> > On Sun, Sep 17, 2023 at 5:09 PM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote:
> >> Due to bpf only supports function arguments up to 16 bytes, according to
> >> AAPCS64, starting from the first argument, each argument is first
> >> attempted to be loaded to 1 or 2 smallest registers from x0-x7, if there
> >> are no enough registers to hold the entire argument, then all remaining
> >> arguments starting from this one are pushed to the stack for passing.
> >
> > If I read the section 6.8.2 of the AAPCS64 correctly, there is a
> > corner case which I believe isn't covered by this logic.
> >
> > void f(u128 a, u128 b, u128, c, u64 d, u128 e, u64 f) {}
> > - a goes on x0 and x1
> > - b goes on x2 and x3
> > - c goes on x4 and x5
> > - d goes on x6
> > - e spills on the stack because it doesn't fit in the remaining regs
> > - f goes on x7
> >
>
> I guess you might have overlooked rule C.13 in AAPCS64. Non-floating type arguments
> are copied to stack under rule C.15/C.17. However, C.13 is applied before C.15/C.17,
> which means that NGRN is set to 8 before the stack is used. That is, all 8 parameter
> arguments are used up and any remaining arguments can only be passed by the stack.
>
> C.13    The NGRN is set to 8.
>
> C.14    The NSAA is rounded up to the larger of 8 or the Natural Alignment of the
>          argument’s type.
>
> C.15    If the argument is a composite type then the argument is copied to memory
>          at the adjusted NSAA. The NSAA is incremented by the size of the argument.
>          The argument has now been allocated.
>
> C.16    If the size of the argument is less than 8 bytes then the size of the argument
>          is set to 8 bytes. The effect is as if the argument was copied to the least
>          significant bits of a 64-bit register and the remaining bits filled with
>          unspecified values.
>
> C.17    The argument is copied to memory at the adjusted NSAA. The NSAA is incremented
>          by the size of the argument. The argument has now been allocated.
>
>
> And the following assembly code also shows 'e' and 'f' are passed by stack.
>
> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f)
> {
>          return e == 5 || f == 7;
> }
>
> asseembly:
>
> func:
>          sub     sp, sp, #64
>          stp     x0, x1, [sp, 48]
>          stp     x2, x3, [sp, 32]
>          stp     x4, x5, [sp, 16]
>          str     x6, [sp, 8]
>          ldr     x0, [sp, 64] // ** load the low 8 bytes of e from SP + 64 **
>          cmp     x0, 5
>          bne     .L27
>          ldr     x0, [sp, 72] // ** load the high 8 bytes of e from SP + 72 **
>          cmp     x0, 0
>          beq     .L22
> .L27:
>          ldr     x0, [sp, 80] // ** load f from SP + 80 **
>          cmp     x0, 7
>          bne     .L24
> .L22:
>          mov     w0, 1
>          b       .L26
> .L24:
>          mov     w0, 0
> .L26:
>          add     sp, sp, 64
>          ret

Ah, that's great! :) It keeps things easy then. Thank you for the explanation!

> Although the above case is fine, the current patch does not handle rule C.14 correctly.
> For example, for the following func, an 8-byte padding is inserted between f and g by
> C.14 to align g to 16 bytes, but this patch mistakenly treats it as part of g.
>
> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f, __int128 g)
> {
> }
>
> Maybe we could fix it by adding argument alignment to struct btf_func_model, I'll
> give it a try.

Good catch!

> > Maybe it would be good to add something pathological like this to the
> > selftests ?
> >
>
> OK, will do

Just a thought on that topic, maybe it would be preferable to have a
new separate test for this ? Currently we have a test for 128b long
arguments and a test for many arguments, it's good to have these
separate because they are two dimensions of bpf architectural support:
if someone adds support for one xor the other feature to an arch, it's
good to see a test go green. Since that new test would cover both long
and many arguments, it should probably be a new test.

> >> -static void save_args(struct jit_ctx *ctx, int args_off, int nregs)
> >> +struct arg_aux {
> >> +       /* how many args are passed through registers, the rest args are
> >
> > the rest of the* args
> >
> >> +        * passed through stack
> >> +        */
> >> +       int args_in_reg;
> >
> > Maybe args_in_regs ? since args can go in multiple regs
> >
> >> +       /* how many registers used for passing arguments */
> >
> > are* used
> >
> >> +       int regs_for_arg;
> >
> > And here regs_for_args ? Since It's the number of registers used for all args
> >
> >> +       /* how many stack slots used for arguments, each slot is 8 bytes */
> >
> > are* used
> >
> >> +       int stack_slots_for_arg;
> >
> > And here stack_slots_for_args, for the same reason as above?
> >
> >> +};
> >> +
> >> +static void calc_arg_aux(const struct btf_func_model *m,
> >> +                        struct arg_aux *a)
> >>   {
> >>          int i;
> >> +       int nregs;
> >> +       int slots;
> >> +       int stack_slots;
> >> +
> >> +       /* verifier ensures m->nr_args <= MAX_BPF_FUNC_ARGS */
> >> +       for (i = 0, nregs = 0; i < m->nr_args; i++) {
> >> +               slots = (m->arg_size[i] + 7) / 8;
> >> +               if (nregs + slots <= 8) /* passed through register ? */
> >> +                       nregs += slots;
> >> +               else
> >> +                       break;
> >> +       }
> >> +
> >> +       a->args_in_reg = i;
> >> +       a->regs_for_arg = nregs;
> >>
> >> -       for (i = 0; i < nregs; i++) {
> >> -               emit(A64_STR64I(i, A64_SP, args_off), ctx);
> >> -               args_off += 8;
> >> +       /* the rest arguments are passed through stack */
> >> +       for (stack_slots = 0; i < m->nr_args; i++)
> >> +               stack_slots += (m->arg_size[i] + 7) / 8;
> >> +
> >> +       a->stack_slots_for_arg = stack_slots;
> >> +}
> >> +
> >> +static void clear_garbage(struct jit_ctx *ctx, int reg, int effective_bytes)
> >> +{
> >> +       if (effective_bytes) {
> >> +               int garbage_bits = 64 - 8 * effective_bytes;
> >> +#ifdef CONFIG_CPU_BIG_ENDIAN
> >> +               /* garbage bits are at the right end */
> >> +               emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
> >> +               emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
> >> +#else
> >> +               /* garbage bits are at the left end */
> >> +               emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
> >> +               emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
> >> +#endif
> >>          }
> >>   }
> >>
> >> -static void restore_args(struct jit_ctx *ctx, int args_off, int nregs)
> >> +static void save_args(struct jit_ctx *ctx, int bargs_off, int oargs_off,
> >> +                     const struct btf_func_model *m,
> >> +                     const struct arg_aux *a,
> >> +                     bool for_call_origin)
> >>   {
> >>          int i;
> >> +       int reg;
> >> +       int doff;
> >> +       int soff;
> >> +       int slots;
> >> +       u8 tmp = bpf2a64[TMP_REG_1];
> >> +
> >> +       /* store argument registers to stack for call bpf, or restore argument
> >
> > to* call bpf or "for the bpf program"
> >
>
> Sorry for these incorrect words :(, all will be fixed in the next version, thanks!

No problem!

> >>   static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
> >>                                struct bpf_tramp_links *tlinks, void *orig_call,
> >> -                             int nregs, u32 flags)
> >> +                             const struct btf_func_model *m,
> >> +                             const struct arg_aux *a,
> >> +                             u32 flags)
> >>   {
> >>          int i;
> >>          int stack_size;
> >>          int retaddr_off;
> >>          int regs_off;
> >>          int retval_off;
> >> -       int args_off;
> >> +       int bargs_off;
> >>          int nregs_off;
> >>          int ip_off;
> >>          int run_ctx_off;
> >> +       int oargs_off;
> >> +       int nregs;
> >>          struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> >>          struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> >>          struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> >> @@ -1859,19 +1951,26 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
> >>           *
> >>           * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
> >>           *                                        BPF_TRAMP_F_RET_FENTRY_RET
> >> -        *
> >>           *                  [ arg reg N         ]
> >>           *                  [ ...               ]
> >> -        * SP + args_off    [ arg reg 1         ]
> >> +        * SP + bargs_off   [ arg reg 1         ] for bpf
> >>           *
> >>           * SP + nregs_off   [ arg regs count    ]
> >>           *
> >>           * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
> >>           *
> >>           * SP + run_ctx_off [ bpf_tramp_run_ctx ]
> >> +        *
> >> +        *                  [ stack arg N       ]
> >> +        *                  [ ...               ]
> >> +        * SP + oargs_off   [ stack arg 1       ] for original func
> >>           */
> >>
> >>          stack_size = 0;
> >> +       oargs_off = stack_size;
> >> +       if (flags & BPF_TRAMP_F_CALL_ORIG)
> >> +               stack_size += 8 * a->stack_slots_for_arg;
> >> +
> >>          run_ctx_off = stack_size;
> >>          /* room for bpf_tramp_run_ctx */
> >>          stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
> >> @@ -1885,9 +1984,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
> >>          /* room for args count */
> >>          stack_size += 8;
> >>
> >> -       args_off = stack_size;
> >> +       bargs_off = stack_size;
> >>          /* room for args */
> >> -       stack_size += nregs * 8;
> >> +       nregs = a->regs_for_arg + a->stack_slots_for_arg;
> >
> > Maybe this name no longer makes sense ?
>
> OK, I'll remove it

Ack