On Thu, 2024-10-10 at 16:23 -0700, Alexei Starovoitov wrote: > On Thu, Oct 10, 2024 at 3:52 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote: > > > > On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote: > > > > > I think this would bring significant speedup. > > > Not sure it would completely fix the issue at hand, > > > as mark_chain_precision() walks like 100 instructions back on each > > > iteration of the loop, but it might be a step in the right direction. > > > > In theory, we can do mark_chain_precision() lazily: > > - mark that certain registers need precision for an instruction; > > - wait till next checkpoint is created; > > - do the walk for marked registers. > > > > This should be simpler to implement on top of what Andrii suggests. > > Indeed it can wait until the next new_state, since > when the next iteration of the loop starts the sl->state > are guaranteed to be some olderstates. Though the verifier may have done > multiple loop iteration traversals, but it's still in 'cur' state, > so states_equal(env, &sl->state, cur...) should still be correct. After spending absolutely unreasonable amount of time on this I finally identified the missing piece of the puzzle. The example program takes so much time to verify because it accumulates a huge jmp_history array (262145 entries) within a single state and constantly executes mark_chain_precision() against it. The reason for such huge jmp_history is an interplay between two conditions: if (env->jmps_processed - env->prev_jmps_processed >= 2 && env->insn_processed - env->prev_insn_processed >= 8) add_new_state = true; and if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && env->insn_processed - env->prev_insn_processed < 100) add_new_state = false; I try to explain it in the commit message for [1]. If one waits long enough, the verifier eventually finishes with -ENOMEM, as it fails to allocate big enough jmp_history array (one has to wait 15 minutes on master, and about a minute with my patch [2], the patch has not changed since our discussion on Wednesday, the error message is the same as we discussed on Wednesday). The discussed change to lazy precision marks propagation turned out to be a dud: - it does not solve the issue at hand, verifier still exits with ENOMEM, just faster; - overall verification speedup for regular programs is not big: $ ./veristat -e file,prog,duration -f 'duration>100000' -C master-baseline.log current.log File Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ------------------------ ---------------- ----------------- ----------------- -------------------- loop1.bpf.o nested_loops 124334 117382 -6952 (-5.59%) loop3.bpf.o while_true 318165 253264 -64901 (-20.40%) pyperf100.bpf.o on_event 704852 692300 -12552 (-1.78%) pyperf180.bpf.o on_event 2304115 2251502 -52613 (-2.28%) pyperf50.bpf.o on_event 183971 184124 +153 (+0.08%) pyperf600.bpf.o on_event 2076073 2051733 -24340 (-1.17%) pyperf600_nounroll.bpf.o on_event 394093 402552 +8459 (+2.15%) strobemeta.bpf.o on_event 169218 169835 +617 (+0.36%) test_verif_scale1.bpf.o balancer_ingress 144779 146486 +1707 (+1.18%) test_verif_scale2.bpf.o balancer_ingress 147078 151902 +4824 (+3.28%) test_verif_scale3.bpf.o balancer_ingress 184959 188723 +3764 (+2.04%) Here only loop3 demonstrates significant speedup (re-run for loop3 several times), but it does not warrant the code change in my opinion. [1] https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@xxxxxxxxx/ [2] https://github.com/eddyz87/bpf/tree/lazy-mark-chain-precision