On Fri, Dec 04, 2020 at 11:42:13AM +0800, Gary Lin wrote: > On Thu, Dec 03, 2020 at 10:14:31AM -0800, Alexei Starovoitov wrote: > > On Thu, Dec 03, 2020 at 12:20:38PM +0100, Eric Dumazet wrote: > > > > > > > > > On 12/3/20 10:12 AM, Gary Lin wrote: > > > > The x64 bpf jit expects bpf images converge within the given passes, but > > > > it could fail to do so with some corner cases. For example: > > > > > > > > l0: ldh [4] > > > > l1: jeq #0x537d, l2, l40 > > > > l2: ld [0] > > > > l3: jeq #0xfa163e0d, l4, l40 > > > > l4: ldh [12] > > > > l5: ldx #0xe > > > > l6: jeq #0x86dd, l41, l7 > > > > l8: ld [x+16] > > > > l9: ja 41 > > > > > > > > [... repeated ja 41 ] > > > > > > > > l40: ja 41 > > > > l41: ret #0 > > > > l42: ld #len > > > > l43: ret a > > > > > > > > This bpf program contains 32 "ja 41" instructions which are effectively > > > > NOPs and designed to be replaced with valid code dynamically. Ideally, > > > > bpf jit should optimize those "ja 41" instructions out when translating > > > > the bpf instructions into x86_64 machine code. However, do_jit() can > > > > only remove one "ja 41" for offset==0 on each pass, so it requires at > > > > least 32 runs to eliminate those JMPs and exceeds the current limit of > > > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON > > > > is set even though it's legit as a classic socket filter. > > > > > > > > Since this kind of programs are usually handcrafted rather than > > > > generated by LLVM, those programs tend to be small. To avoid increasing > > > > the complexity of BPF JIT, this commit just bumps the number of passes > > > > to 64 as suggested by Daniel to make it less likely to fail on such cases. > > > > > > > > > > Another idea would be to stop trying to reduce size of generated > > > code after a given number of passes have been attempted. > > > > > > Because even a limit of 64 wont ensure all 'valid' programs can be JITed. > > > > +1. > > Bumping the limit is not solving anything. > > It only allows bad actors force kernel to spend more time in JIT. > > If we're holding locks the longer looping may cause issues. > > I think JIT is parallel enough, but still it's a concern. > > > > I wonder how assemblers deal with it? > > They probably face the same issue. > > > > Instead of going back to 32-bit jumps and suddenly increase image size > > I think we can do nop padding instead. > > After few loops every insn is more or less optimal. > > I think the fix could be something like: > > if (is_imm8(jmp_offset)) { > > EMIT2(jmp_cond, jmp_offset); > > if (loop_cnt > 5) { > > EMIT N nops > > where N = addrs[i] - addrs[i - 1]; // not sure about this math. > > N can be 0 or 4 here. > > // or may be NOPs should be emitted before EMIT2. > > // need to think it through > > } > > } > This looks promising. Once we switch to nop padding, the image is likely > to converge soon. Maybe we can postpone the padding to the last 5 passes > so that do_jit() could optimize the image a bit more. > > > Will something like this work? > > I think that's what you're suggesting, right? > > > Besides nop padding, the optimization for 0 offset jump also has to be > disabled since it's actually the one causing image shrinking in my case. > > Gary Lin > Here is my testing patch. My sample program got accepted with this patch. Haven't done the further test though. --- arch/x86/net/bpf_jit_comp.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 796506dcfc42..6a39c5ba6383 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -790,7 +790,7 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt, } static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, - int oldproglen, struct jit_context *ctx) + int oldproglen, struct jit_context *ctx, bool ja_padding) { bool tail_call_reachable = bpf_prog->aux->tail_call_reachable; struct bpf_insn *insn = bpf_prog->insnsi; @@ -800,6 +800,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, bool seen_exit = false; u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; int i, cnt = 0, excnt = 0; + int p; int proglen = 0; u8 *prog = temp; @@ -1410,6 +1411,11 @@ xadd: if (is_imm8(insn->off)) jmp_offset = addrs[i + insn->off] - addrs[i]; if (is_imm8(jmp_offset)) { EMIT2(jmp_cond, jmp_offset); + ilen = prog - temp; + if (ja_padding && (addrs[i] - addrs[i-1]) > ilen) { + for (p = 0; p < 4; p++) + EMIT1(0x90); + } } else if (is_simm32(jmp_offset)) { EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset); } else { @@ -1431,12 +1437,17 @@ xadd: if (is_imm8(insn->off)) else jmp_offset = addrs[i + insn->off] - addrs[i]; - if (!jmp_offset) + if (!jmp_offset && !ja_padding) /* Optimize out nop jumps */ break; emit_jmp: if (is_imm8(jmp_offset)) { EMIT2(0xEB, jmp_offset); + ilen = prog - temp; + if (ja_padding && (addrs[i] - addrs[i-1]) > ilen) { + for (p = 0; p < 4; p++) + EMIT1(0x90); + } } else if (is_simm32(jmp_offset)) { EMIT1_off32(0xE9, jmp_offset); } else { @@ -1972,6 +1983,9 @@ struct x64_jit_data { struct jit_context ctx; }; +#define MAX_PASSES 20 +#define PADDING_PASSES (MAX_PASSES - 5) + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_binary_header *header = NULL; @@ -2043,7 +2057,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) * pass to emit the final image. */ for (pass = 0; pass < 20 || image; pass++) { - proglen = do_jit(prog, addrs, image, oldproglen, &ctx); + if (pass < PADDING_PASSES) + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, false); + else + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, true); if (proglen <= 0) { out_image: image = NULL;