Re: [PATCH bpf-next 3/3] bpf: convert explored_states to hash table

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 21 May 2019 21:09:03 -0700

On Tue, May 21, 2019 at 7:17 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, May 21, 2019 at 05:55:06PM -0700, Andrii Nakryiko wrote:
> > On Tue, May 21, 2019 at 4:08 PM Alexei Starovoitov <ast@xxxxxxxxxx> wrote:
> > >
> > > All prune points inside a callee bpf function most likely will have
> > > different callsites. For example, if function foo() is called from
> > > two callsites the half of explored states in all prune points in foo()
> > > will be useless for subsequent walking of one of those callsites.
> > > Fortunately explored_states pruning heuristics keeps the number of states
> > > per prune point small, but walking these states is still a waste of cpu
> > > time when the callsite of the current state is different from the callsite
> > > of the explored state.
> > >
> > > To improve pruning logic convert explored_states into hash table and
> > > use simple insn_idx ^ callsite hash to select hash bucket.
> > > This optimization has no effect on programs without bpf2bpf calls
> > > and drastically improves programs with calls.
> > > In the later case it reduces total memory consumption in 1M scale tests
> > > by almost 3 times (peak_states drops from 5752 to 2016).
> > >
> > > Care should be taken when comparing the states for equivalency.
> > > Since the same hash bucket can now contain states with different indices
> > > the insn_idx has to be part of verifier_state and compared.
> > >
> > > Different hash table sizes and different hash functions were explored,
> > > but the results were not significantly better vs this patch.
> > > They can be improved in the future.
> > >
> > > Hit/miss heuristic is not counting index miscompare as a miss.
> > > Otherwise verifier stats become unstable when experimenting
> > > with different hash functions.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > ---
> > >  include/linux/bpf_verifier.h |  1 +
> > >  kernel/bpf/verifier.c        | 23 ++++++++++++++++++-----
> > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > index 02bba09a0ea1..405b502283c5 100644
> > > --- a/include/linux/bpf_verifier.h
> > > +++ b/include/linux/bpf_verifier.h
> > > @@ -187,6 +187,7 @@ struct bpf_func_state {
> > >  struct bpf_verifier_state {
> > >         /* call stack tracking */
> > >         struct bpf_func_state *frame[MAX_CALL_FRAMES];
> > > +       u32 insn_idx;
> > >         u32 curframe;
> > >         u32 active_spin_lock;
> > >         bool speculative;
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 89097a4b1bf3..082f6eefb1c4 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -5435,11 +5435,19 @@ enum {
> > >         BRANCH = 2,
> > >  };
> > >
> > > +static u32 state_htab_size(struct bpf_verifier_env *env)
> >
> > maybe mark it as inline function? it's called pretty heavily.
>
> The kernel convention is no 'inline' in .c

Ah, ok.

>
> > >  static struct bpf_verifier_state_list **explored_state(
> > >                                         struct bpf_verifier_env *env,
> > >                                         int idx)
> > >  {
> > > -       return &env->explored_states[idx];
> > > +       struct bpf_verifier_state *cur = env->cur_state;
> > > +       struct bpf_func_state *state = cur->frame[cur->curframe];
> > > +
> > > +       return &env->explored_states[(idx ^ state->callsite) % state_htab_size(env)];
> >
> > % is slow, see [1] for faster alternative.
> >
> > Alternatively, if you can make sure that hash size is power of two,
> > then multiplicative Fibonacci hashing is preferred ([2]).
> >
> > [1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
> > [2] https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
>
> a % b -> ((u64) a * (u64) b) >> 32 transformation assumes good
> distribution of 'a'. Here it's clearly not the case.
> According to Jakub's analysis the verifier marks every 4th insn
> as prune_point, so this array is only quarter full.
> As an experiment I've tried to shrink the size by three times and
> didn't observe any significant slowdown in verification time,
> but decided to keep it as-is for simplicity.
> For the same reasons I avoided roundup_to_power2.
> I prefer readability vs microptimization.
> The cost of modulo vs multiple alu is a noise
> considering everything the verifier is doing.
>

Fair enough, it's totally a microoptimization.

> > >  }
> > >
> > >  static void init_explored_state(struct bpf_verifier_env *env, int idx)
> > > @@ -6018,7 +6026,8 @@ static void clean_live_states(struct bpf_verifier_env *env, int insn,
> > >
> > >         sl = *explored_state(env, insn);
> > >         while (sl) {
> > > -               if (sl->state.curframe != cur->curframe)
> > > +               if (sl->state.insn_idx != insn ||
> > > +                   sl->state.curframe != cur->curframe)
> > >                         goto next;
> > >                 for (i = 0; i <= cur->curframe; i++)
> > >                         if (sl->state.frame[i]->callsite != cur->frame[i]->callsite)
> > > @@ -6384,6 +6393,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
> > >         clean_live_states(env, insn_idx, cur);
> > >
> > >         while (sl) {
> > > +               states_cnt++;
> > > +               if (sl->state.insn_idx != insn_idx)
> > > +                       goto next;
> >
> > Shouldn't this be checked inside states_equal? Or you are trying to
> > avoid function call overhead? If the latter is the case, then you
> > should probably compare curframe as well here?
>
> It's not equivalent.
> Here is what commit log say:

Ah, my bad, forgot about that by the time I got to the code. Might be
a good idea to put this in comments in the code.

>  Hit/miss heuristic is not counting index miscompare as a miss.
>  Otherwise verifier stats become unstable when experimenting
>  with different hash functions.
>
> If insn comparison is done inside states_equal() then
> miss > hit * 3 + 3 heuristic affects 'collisions'.
> The cases where different indices fall into the same bucket.
> And verifier stats fluctuate when hash function or size changes.
>

Yeah, that make sense. I wonder if curframe comparison has similar
effect, states from different frames seem similar to hash collisions
between different instruction states in that regard. Or they are not?