Re: 32-bit zext time complexity (Was Re: [PATCH bpf-next] selftests/bpf: two scale tests)

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 25 Apr 2019 15:10:24 -0700

On Thu, Apr 25, 2019 at 08:25:44AM +0100, Jiong Wang wrote:
> 
> Alexei Starovoitov writes:
> 
> > On Thu, Apr 25, 2019 at 12:07:06AM +0100, Jiong Wang wrote:
> >> 
> >> Alexei Starovoitov writes:
> >> 
> >> > Add two tests to check that sequence of 1024 jumps is verifiable.
> >> >
> >> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> >> > ---
> >> >  tools/testing/selftests/bpf/test_verifier.c  | 70 ++++++++++++++++++++
> >> >  tools/testing/selftests/bpf/verifier/scale.c | 18 +++++
> >> 
> >> I am rebasing 32-bit opt pass on top of latest bpf-next and found these new
> >> tests take more than 20 minutes to run and had not finished after that.
> >> 
> >> The reason the following insn filling insde bpf_fill_scale1 is generating
> >> nearly 1M insn whose results are recognized as safe to be poisoned.
> >> 
> >>   bpf_fill_scale1:
> >>     while (i < MAX_TEST_INSNS - 1025)
> >>       insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42);
> >> 
> >> For each hi32 poisoning, there will be one call to "bpf_patch_insn_data"
> >> which actually is not cheap (adjust jump insns, insn aux info etc). Now,
> >> 1M call to it has exhausted server resources as described, 20minutes running
> >> still not finished.
> >> 
> >> For real world applications, we don't do hi32 poisoning, and there isn't much
> >> lo32 zext. Benchmarking those bpf programs inside Cilium shows the final
> >> zext pass adds about 8% ~ 15% verification time.
> >> 
> >> The zext pass based on top of "bpf_patch_insn_data" looks more and more is
> >> not the best approach to utilize the read32 analysis results.
> >> 
> >> Previously, in v1 cover letter, I listed some of my other thoughts on how to
> >> utilize the liveness analysis results:
> >> 
> >>    1 Minor change on back-end JIT hook, also pass aux_insn information to
> >>      back-ends so they could have per insn information and they could do
> >>      zero extension for the marked insn themselves using the most
> >>      efficient native insn.
> >> 
> >>    2 Introduce zero extension insn for eBPF. Then verifier could insert
> >>      the new zext insn instead of lshift + rshift. zext could be JITed
> >>      more efficiently.
> >> 
> >>    3 Otherwise JIT back-ends need to do peephole to catch lshift + rshift
> >>      and turn them into native zext.
> >
> > all options sounds like hacks to workaround inefficient bpf_patch_insn_data().
> > Especially option 2 will work only because single insn is replaced
> > with another insn ?
> 
> Option 1 should be a generic solution. It is passing verifier analysis
> results generated by insn walk down to JIT back-ends. The information
> passed down could be any analysis result useful for JIT code-gen.
> 
> > Let's fix the algo of bpf_patch_insn_data() instead, so that 1 insn -> 2+ insn
> > is also fast.
> 
> The issue with 1 insn -> 2+ insn should be calling of bpf_adj_branches
> which is doing another for_each_insn_in_prog traversal, so the zext
> insertion becomes something like:
> 
>   for_each_insn_in_prog
>   ...
>      if (zext)
>      ...
>        for_each_insn_in_prog
> 
> which is quadratic. One solution is we chain all branch insns during
> previous insn traversal in for example cfg check, and keep the information
> somewhere info bpf_prog (env->insn_aux_data is a good place to keep such
> information, but insn patch helpers are supposed to work with bpf_prog)
> then bpf_adj_branches could traversal this chain instead of iterating
> through all insns.

I don't think it will make it much faster. There could be just as many
jumps as there are instructions.
Note that bpf_patch_insn_single() is calling bpf_adj_branches() twice too.
And dead_code + convert_ctx + fixup_bpf_calls are calling
bpf_patch_insn_single() a lot.
How about before dead_code pass we convert the program into basic-block
format, patch it all, and then convert from bb back to offsets.
Patching will become very cheap, since no loop over program will be
necessary. A jump from bb-N to bb-M will stay as-is regardless
of amount of patching was done inside each bb.
The loops inside these patching passes will be converted from:
for (i = 0; i < insn_cnt; i++, insn++)
into:
for each bb
  for each insn in bb

As far as option 1 "also pass aux_insn information to JITs"...
in theory it's fine, but looks like big refactoring to existing code.
So if you want to make this bb conversion as step 2 and unblock the
current patch set faster I suggest to go with option 2 "Introduce zero extension insn".