On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote: > > > > On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote: > > > In [1] syzbot reported an interesting BPF program. > > > Verification for this program takes a very long time. > > > > > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@xxxxxxxxxx/ > > > > > > The program could be simplified to the following snippet: > > > > > > /* Program type is kprobe */ > > > r7 = *(u16 *)(r1 +0); > > > 1: r7 += 0x1ab064b9; > > > if r7 & 0x702000 goto 1b; > > > r7 &= 0x1ee60e; > > > r7 += r1; > > > if r7 s> 0x37d2 goto +0; > > > r0 = 0; > > > exit; > > > > Answering a few questions from off-list discussion with Alexei. > > The test is not specific for jset instruction, e.g. the following > > program exhibits similar behaviour: > > > > SEC("kprobe") > > __failure __log_level(4) > > __msg("BPF program is too large.") > > __naked void short_loop1(void) > > { > > asm volatile ( > > " r7 = *(u16 *)(r1 +0);" > > " r8 = *(u64 *)(r1 +16);" > > "1: r7 += 0x1ab064b9;" > > "if r7 < r8 goto 1b;" > > " r7 &= 0x1ee60e;" > > " r7 += r1;" > > " if r7 s> 0x37d2 goto +0;" > > " r0 = 0;" > > " exit;" > > ::: __clobber_all); > > } > > > > > The snippet exhibits the following behaviour depending on > > > BPF_COMPLEXITY_LIMIT_INSNS: > > > - at 1,000,000 verification does not finish in 15 minutes; > > > - at 100,000 verification finishes in 15 seconds; > > > - at 100 it is possible to get some verifier log. > > > > Still investigating why running time change is non-linear. > > > > [...] > > > > > This patch forcibly enables checkpoints for each loop back-edge. > > > This helps with the programs in question, as verification of both > > > syzbot program and reduced snippet finishes in ~2.5 sec. > > > > There is the following code in is_state_visited(): > > > > ... > > /* if the verifier is processing a loop, avoid adding new state > > * too often, since different loop iterations have distinct > > * states and may not help future pruning. > > * This threshold shouldn't be too low to make sure that > > * a loop with large bound will be rejected quickly. > > * The most abusive loop will be: > > * r1 += 1 > > * if r1 < 1000000 goto pc-2 > > * 1M insn_procssed limit / 100 == 10k peak states. > > * This threshold shouldn't be too high either, since states > > * at the end of the loop are likely to be useful in pruning. > > */ > > skip_inf_loop_check: > > if (!env->test_state_freq && > > env->jmps_processed - env->prev_jmps_processed < 20 && > > env->insn_processed - env->prev_insn_processed < 100) > > add_new_state = false; > > goto miss; > > ... > > > > Which runs into a direct contradiction with what I do in this patch, > > so either I need to change the patch or this fragment needs adjustment. > > Yeah. It's not saving all states exactly to avoid: > loop3.bpf.o while_true > +350359 (+106.88%) +0 (+0.00%) +101448 (+1049.86%) > > 100k states is a lot of memory. It might cause OOMs. > > pyperf600 improvement is certainly nice though. > > It feels it's better to adjust above heuristic 20 && 100 instead > of forcing save state everywhere. > > Something should be done about: > 71.25% [k] __mark_chain_precision > 24.81% [k] bt_sync_linked_regs > as well. > The algorithm there needs some tweaks. If we were to store bpf_jmp_history_entry for each instruction (and we can do that efficiently, memory-wise, I had the patch), and then for each instruction we maintained a list of "input" regs/slots and corresponding "output" regs/slots as we simulate each instruction forward, I think __mark_chain_precision would be much simpler and thus faster. We'd basically just walk backwards instruction by instruction, check if any of the output regs/slots need to be precise (few bitmasks intersection), and if yes, set all input regs/slots as "need precision", and just continue forward. I think it's actually a simpler approach and should be faster. Simpler because it's easy to tell inputs/outputs while doing forward instruction processing. Faster because __mark_chain_precision would only do very simple operation without lots of branching and checks. We are already halfway there, tbh, with the linked registers and insn_stack_access_flags() (i.e., information that we couldn't re-derive in the backwards pass). So maybe we should bite the bullet and go all in? P.S. And just in general, there is a certain appeal to always having a complete history up to any current point in time. Feels like this would be useful in the future for some extra optimizations or checks. > > pw-bot: cr