Re: [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Mon, 3 Mar 2025 23:40:01 +0100

On Mon, 3 Mar 2025 at 23:25, Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>
> On Mon, 3 Mar 2025 at 22:59, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Sun, Mar 2, 2025 at 12:13 PM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > Implement support in the verifier for replacing may_goto implementation
> > > from a counter-based approach to one which samples time on the local CPU
> > > to have a bigger loop bound.
> > >
> > > We implement it by maintaining 16-bytes per-stack frame, and using 8
> > > bytes for maintaining the count for amortizing time sampling, and 8
> > > bytes for the starting timestamp. To minimize overhead, we need to avoid
> > > spilling and filling of registers around this sequence, so we push this
> > > cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
> > > a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
> > > the count to store into the stack through BPF_REG_AX. All caller-saved
> > > registers (r0-r5) are guaranteed to remain untouched.
> > >
> > > The loop can be broken by returning count as 0, otherwise we dispatch
> > > into the function when the count becomes 1, and the runtime chooses to
> > > refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
> > > and aborting it.
> > >
> > > Since the check for 0 is done right after loading the count from the
> > > stack, all subsequent cond_break sequences should immediately break as
> > > well.
> > >
> > > We pass in the stack_depth of the count (and thus the timestamp, by
> > > adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
> > > passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
> > > by adding the offset to r10/fp. This adjustment will be arch specific,
> > > and the next patch will introduce support for x86.
> > >
> > > Note that depending on loop complexity, time spent in the loop can be
> > > more than the current limit (250 ms), but imposing an upper bound on
> > > program runtime is an orthogonal problem which will be addressed when
> > > program cancellations are supported.
> > >
> > > The current time afforded by cond_break may not be enough for cases
> > > where BPF programs want to implement locking algorithms inline, and use
> > > cond_break as a promise to the verifier that they will eventually
> > > terminate.
> > >
> > > Below are some benchmarking numbers on the time taken per-iteration for
> > > an empty loop that counts the number of iterations until cond_break
> > > fires. For comparison, we compare it against bpf_for/bpf_repeat which is
> > > another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
> > > hardware used for benchmarking was a Saphire Rapids Intel server with
> > > performance governor enabled.
> > >
> > > +-----------------------------+--------------+--------------+------------------+
> > > | Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
> > > +-----------------------------|--------------+--------------+------------------+
> > > | may_goto                    | 8388608      |  3           |   0.36           |
> > > | timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
> > > | bpf_for                     | 8388608      |  10          |   1.19           |
> > > +-----------------------------+--------------+--------------+------------------+
> > >
> > > This gives a good approximation at low overhead while staying close to
> > > the current implementation.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> > > ---
> > >  include/linux/bpf.h    |  1 +
> > >  include/linux/filter.h |  8 +++++++
> > >  kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
> > >  kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
> > >  4 files changed, 84 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index aec102868b93..788f6ca374e9 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1986,6 +1986,7 @@ struct bpf_array {
> > >   */
> > >  enum {
> > >         BPF_MAX_LOOPS = 8 * 1024 * 1024,
> > > +       BPF_MAX_TIMED_LOOPS = 0xffff,
> > >  };
> > >
> > >  #define BPF_F_ACCESS_MASK      (BPF_F_RDONLY |         \
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 3ed6eb9e7c73..02dda5c53d91 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -669,6 +669,11 @@ struct bpf_prog_stats {
> > >         struct u64_stats_sync syncp;
> > >  } __aligned(2 * sizeof(u64));
> > >
> > > +struct bpf_timed_may_goto {
> > > +       u64 count;
> > > +       u64 timestamp;
> > > +};
> > > +
> > >  struct sk_filter {
> > >         refcount_t      refcnt;
> > >         struct rcu_head rcu;
> > > @@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
> > >  bool bpf_jit_supports_arena(void);
> > >  bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
> > >  bool bpf_jit_supports_private_stack(void);
> > > +bool bpf_jit_supports_timed_may_goto(void);
> > >  u64 bpf_arch_uaddress_limit(void);
> > >  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> > > +u64 arch_bpf_timed_may_goto(void);
> > > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
> > >  bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
> > >
> > >  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index a0200fbbace9..b3f7c7bd08d3 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
> > >  {
> > >  }
> > >
> > > +bool __weak bpf_jit_supports_timed_may_goto(void)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +u64 __weak arch_bpf_timed_may_goto(void)
> > > +{
> > > +       return 0;
> > > +}
> > > +
> > > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
> > > +{
> > > +       u64 time = ktime_get_mono_fast_ns();
> >
> > let's move the call after !p->count check to avoid unused work.
> >
>
> Ok.
>
> > > +
> > > +       /* If the count is zero, we've already broken a prior loop in this stack
> > > +        * frame, let's just exit quickly.
> > > +        */
> >
> > Let's use normal kernel comment style in all new code.
> > I think even netdev folks allow both styles now.
> >
>
> Ack.
>
> > > +       if (!p->count)
> > > +               return 0;
> > > +       /* Populate the timestamp for this stack frame. */
> > > +       if (!p->timestamp) {
> > > +               p->timestamp = time;
> > > +               return BPF_MAX_TIMED_LOOPS;
> > > +       }
> > > +       /* Check if we've exhausted our time slice. */
> > > +       if (time - p->timestamp >= (NSEC_PER_SEC / 4))
> > > +               return 0;
> > > +       /* Refresh the count for the stack frame. */
> > > +       return BPF_MAX_TIMED_LOOPS;
> > > +}
> > > +
> > >  /* for configs without MMU or 32-bit */
> > >  __weak const struct bpf_map_ops arena_map_ops;
> > >  __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index dcd0da4e62fc..79bfb1932f40 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >                         goto next_insn;
> > >                 }
> > >
> > > -               if (is_may_goto_insn(insn)) {
> > > +               if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
> > > +                       int stack_off_cnt = -stack_depth - 16;
> > > +
> > > +                       /* Two 8 byte slots, depth-16 stores the count, and
> > > +                        * depth-8 stores the start timestamp of the loop.
> > > +                        */
> > > +                       stack_depth_extra = 16;
> > > +                       insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
> > > +                       if (insn->off >= 0)
> > > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
> > > +                       else
> > > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
> > > +                       insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
> > > +                       insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);
> >
> > Maybe != 0 instead ?
> > Otherwise it's off by 1.
>
> We'll never do the sub with AX=0, we'll always break out in that case.
> It starts at 0xffff, so when it's 2 in stack, and 1 on subtraction, we
> will do the call.
> This resets it to 0xffff or 0.
>
> But it's late and I could be missing something.

Which means we will never see p->count as 0 in
bpf_check_timed_may_goto, since we'll go through the first condition
checking ax to be 0, so I will just remove that condition from the
function.

>
> >
> > > +                       insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);
> >
> > Please add a comment that BPF_REG_AX is used as an argument
> > register and contains return value too.
>
> Ok, will do.
>
> >
> > I looked at a couple other non-x86 JITs and I think this calling
> > convention should work for them too.
> >
> > > +                       insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));
> >
> > Use BPF_EMIT_CALL() instead?
> >
>
> Ack.
>
> > > [...]