Alexei Starovoitov writes: > On Thu, Apr 25, 2019 at 08:25:44AM +0100, Jiong Wang wrote: >> >> Alexei Starovoitov writes: >> >> > On Thu, Apr 25, 2019 at 12:07:06AM +0100, Jiong Wang wrote: >> >> >> >> Alexei Starovoitov writes: >> >> >> >> > Add two tests to check that sequence of 1024 jumps is verifiable. >> >> > >> >> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> >> >> > --- >> >> > tools/testing/selftests/bpf/test_verifier.c | 70 ++++++++++++++++++++ >> >> > tools/testing/selftests/bpf/verifier/scale.c | 18 +++++ >> >> >> >> I am rebasing 32-bit opt pass on top of latest bpf-next and found these new >> >> tests take more than 20 minutes to run and had not finished after that. >> >> >> >> The reason the following insn filling insde bpf_fill_scale1 is generating >> >> nearly 1M insn whose results are recognized as safe to be poisoned. >> >> >> >> bpf_fill_scale1: >> >> while (i < MAX_TEST_INSNS - 1025) >> >> insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42); >> >> >> >> For each hi32 poisoning, there will be one call to "bpf_patch_insn_data" >> >> which actually is not cheap (adjust jump insns, insn aux info etc). Now, >> >> 1M call to it has exhausted server resources as described, 20minutes running >> >> still not finished. >> >> >> >> For real world applications, we don't do hi32 poisoning, and there isn't much >> >> lo32 zext. Benchmarking those bpf programs inside Cilium shows the final >> >> zext pass adds about 8% ~ 15% verification time. >> >> >> >> The zext pass based on top of "bpf_patch_insn_data" looks more and more is >> >> not the best approach to utilize the read32 analysis results. >> >> >> >> Previously, in v1 cover letter, I listed some of my other thoughts on how to >> >> utilize the liveness analysis results: >> >> >> >> 1 Minor change on back-end JIT hook, also pass aux_insn information to >> >> back-ends so they could have per insn information and they could do >> >> zero extension for the marked insn themselves using the most >> >> efficient native insn. >> >> >> >> 2 Introduce zero extension insn for eBPF. Then verifier could insert >> >> the new zext insn instead of lshift + rshift. zext could be JITed >> >> more efficiently. >> >> >> >> 3 Otherwise JIT back-ends need to do peephole to catch lshift + rshift >> >> and turn them into native zext. >> > >> > all options sounds like hacks to workaround inefficient bpf_patch_insn_data(). >> > Especially option 2 will work only because single insn is replaced >> > with another insn ? >> >> Option 1 should be a generic solution. It is passing verifier analysis >> results generated by insn walk down to JIT back-ends. The information >> passed down could be any analysis result useful for JIT code-gen. >> >> > Let's fix the algo of bpf_patch_insn_data() instead, so that 1 insn -> 2+ insn >> > is also fast. >> >> The issue with 1 insn -> 2+ insn should be calling of bpf_adj_branches >> which is doing another for_each_insn_in_prog traversal, so the zext >> insertion becomes something like: >> >> for_each_insn_in_prog >> ... >> if (zext) >> ... >> for_each_insn_in_prog >> >> which is quadratic. One solution is we chain all branch insns during >> previous insn traversal in for example cfg check, and keep the information >> somewhere info bpf_prog (env->insn_aux_data is a good place to keep such >> information, but insn patch helpers are supposed to work with bpf_prog) >> then bpf_adj_branches could traversal this chain instead of iterating >> through all insns. Thanks very much for the feedbacks. > I don't think it will make it much faster. There could be just as many > jumps as there are instructions. Benchmarked a basic implemention on a couple of bpf programs in Cilium repo (the impl is a relocation list kept in a array built as by-product of check_subprogs. during patch, walk this relocation list instead of all insns. Then calculate time by 10 times run and take average) - The time spent on zero-extension pass is ~30% less - The time spent on convert_ctx_accesses + fixup_bpf_call + fixup_call_args is ~15% less - The time spent on dead code elim pass is not changed which makes sense as dead code is rare so patch infra is not triggered much. So, looks like could help a little bit on daily program, but agree no fundamental improvements, it is still N(insn_needs_patch) * N(branch), and both N could go very big. > Note that bpf_patch_insn_single() is calling bpf_adj_branches() twice too. > And dead_code + convert_ctx + fixup_bpf_calls are calling > bpf_patch_insn_single() a lot. > How about before dead_code pass we convert the program into basic-block > format, patch it all, and then convert from bb back to offsets. > Patching will become very cheap, since no loop over program will be > necessary. A jump from bb-N to bb-M will stay as-is regardless > of amount of patching was done inside each bb. > The loops inside these patching passes will be converted from: > for (i = 0; i < insn_cnt; i++, insn++) > into: > for each bb > for each insn in bb Interesting. If I am understanding correctly, BB then needs to support dynamic insn buffer resize. And after all insn patching finished, all BBs are finalized, we then linearized BBs (in a best order) to generate the final bpf image. > As far as option 1 "also pass aux_insn information to JITs"... > in theory it's fine, but looks like big refactoring to existing code. Will do quick explore, might turn out to be small change. > So if you want to make this bb conversion as step 2 and unblock the > current patch set faster I suggest to go with option 2 "Introduce zero > extension insn". A second think, even zero extension insn introduced, it is inserted after the sub-register write insn, so we are still doing insert *not* replacement, insn_delta inside bpf_patch_insn_single will be 1, so the slow path will always be taken (memmove + bpf_prog_realloc + 2 x bpf_adj_branches). For the 1M-scale test, bpf_patch_insn_single is triggered because of hi32 poisoning, not lo32 zext. So we just need to change MOV32 to MOV64 in the testcase which doesn't break the initial testing purpose of this testcase from my understanding. This then could avoid 1M call to bpf_patch_insn_single and pass the test after 32-bit opt patch set applied. Without this change and with hi32 randomization enabled, scale tests will still hang before insn patch infra improved. @@ -228,7 +228,7 @@ static void bpf_fill_scale1(struct bpf_test *self) - insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42); + insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42); This is change is not to paperover the underlying issue. We now know the existing insn patch infra doesn't scale to million level, so could work on improving it in the next step. At the same time the issue exposed from hi32 poisoning does raise alarm that there could be the same issue for lo32 zext, therefore this patch set doesn't scale if there are lots of insns to be zero extended, even though this may not happen in real world bpf prog. IMHO, avoid using insn patching when possible might always be better. So, if option 1 turns out to also generate clean patch set and introduce small changes, I am going to follow it in the update version. Please let me know if you have different opinion. Regards, Jiong