Re: [PATCH bpf-next v1 1/2] bpf: force checkpoints at loop back-edges

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 10 Oct 2024 15:13:38 -0700

On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:
> >
> > On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote:
> > > In [1] syzbot reported an interesting BPF program.
> > > Verification for this program takes a very long time.
> > >
> > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@xxxxxxxxxx/
> > >
> > > The program could be simplified to the following snippet:
> > >
> > >     /* Program type is kprobe */
> > >        r7 = *(u16 *)(r1 +0);
> > >     1: r7 += 0x1ab064b9;
> > >        if r7 & 0x702000 goto 1b;
> > >        r7 &= 0x1ee60e;
> > >        r7 += r1;
> > >        if r7 s> 0x37d2 goto +0;
> > >        r0 = 0;
> > >        exit;
> >
> > Answering a few questions from off-list discussion with Alexei.
> > The test is not specific for jset instruction, e.g. the following
> > program exhibits similar behaviour:
> >
> > SEC("kprobe")
> > __failure __log_level(4)
> > __msg("BPF program is too large.")
> > __naked void short_loop1(void)
> > {
> >         asm volatile (
> >         "   r7 = *(u16 *)(r1 +0);"
> >         "   r8 = *(u64 *)(r1 +16);"
> >         "1: r7 += 0x1ab064b9;"
> >         "if r7 < r8 goto 1b;"
> >         "   r7 &= 0x1ee60e;"
> >         "   r7 += r1;"
> >         "   if r7 s> 0x37d2 goto +0;"
> >         "   r0 = 0;"
> >         "   exit;"
> >         ::: __clobber_all);
> > }
> >
> > > The snippet exhibits the following behaviour depending on
> > > BPF_COMPLEXITY_LIMIT_INSNS:
> > > - at 1,000,000 verification does not finish in 15 minutes;
> > > - at 100,000 verification finishes in 15 seconds;
> > > - at 100 it is possible to get some verifier log.
> >
> > Still investigating why running time change is non-linear.
> >
> > [...]
> >
> > > This patch forcibly enables checkpoints for each loop back-edge.
> > > This helps with the programs in question, as verification of both
> > > syzbot program and reduced snippet finishes in ~2.5 sec.
> >
> > There is the following code in is_state_visited():
> >
> >                         ...
> >                         /* if the verifier is processing a loop, avoid adding new state
> >                          * too often, since different loop iterations have distinct
> >                          * states and may not help future pruning.
> >                          * This threshold shouldn't be too low to make sure that
> >                          * a loop with large bound will be rejected quickly.
> >                          * The most abusive loop will be:
> >                          * r1 += 1
> >                          * if r1 < 1000000 goto pc-2
> >                          * 1M insn_procssed limit / 100 == 10k peak states.
> >                          * This threshold shouldn't be too high either, since states
> >                          * at the end of the loop are likely to be useful in pruning.
> >                          */
> > skip_inf_loop_check:
> >                         if (!env->test_state_freq &&
> >                             env->jmps_processed - env->prev_jmps_processed < 20 &&
> >                             env->insn_processed - env->prev_insn_processed < 100)
> >                                 add_new_state = false;
> >                         goto miss;
> >                         ...
> >
> > Which runs into a direct contradiction with what I do in this patch,
> > so either I need to change the patch or this fragment needs adjustment.
>
> Yeah. It's not saving all states exactly to avoid:
> loop3.bpf.o                 while_true
> +350359 (+106.88%)       +0 (+0.00%)  +101448 (+1049.86%)
>
> 100k states is a lot of memory. It might cause OOMs.
>
> pyperf600 improvement is certainly nice though.
>
> It feels it's better to adjust above heuristic 20 && 100 instead
> of forcing save state everywhere.
>
> Something should be done about:
>           71.25%        [k] __mark_chain_precision
>           24.81%        [k] bt_sync_linked_regs
> as well.
> The algorithm there needs some tweaks.

If we were to store bpf_jmp_history_entry for each instruction (and we
can do that efficiently, memory-wise, I had the patch), and then for
each instruction we maintained a list of "input" regs/slots and
corresponding "output" regs/slots as we simulate each instruction
forward, I think __mark_chain_precision would be much simpler and thus
faster. We'd basically just walk backwards instruction by instruction,
check if any of the output regs/slots need to be precise (few bitmasks
intersection), and if yes, set all input regs/slots as "need
precision", and just continue forward.

I think it's actually a simpler approach and should be faster. Simpler
because it's easy to tell inputs/outputs while doing forward
instruction processing. Faster because __mark_chain_precision would
only do very simple operation without lots of branching and checks.

We are already halfway there, tbh, with the linked registers and
insn_stack_access_flags() (i.e., information that we couldn't
re-derive in the backwards pass). So maybe we should bite the bullet
and go all in?

P.S. And just in general, there is a certain appeal to always having a
complete history up to any current point in time. Feels like this
would be useful in the future for some extra optimizations or checks.

>
> pw-bot: cr