Re: [PATCH] bpf, x64: bump the number of passes to 64

Gary Lin <glin@xxxxxxxx> · Fri, 4 Dec 2020 18:15:04 +0800

On Fri, Dec 04, 2020 at 11:42:13AM +0800, Gary Lin wrote:
> On Thu, Dec 03, 2020 at 10:14:31AM -0800, Alexei Starovoitov wrote:
> > On Thu, Dec 03, 2020 at 12:20:38PM +0100, Eric Dumazet wrote:
> > > 
> > > 
> > > On 12/3/20 10:12 AM, Gary Lin wrote:
> > > > The x64 bpf jit expects bpf images converge within the given passes, but
> > > > it could fail to do so with some corner cases. For example:
> > > > 
> > > >       l0:     ldh [4]
> > > >       l1:     jeq #0x537d, l2, l40
> > > >       l2:     ld [0]
> > > >       l3:     jeq #0xfa163e0d, l4, l40
> > > >       l4:     ldh [12]
> > > >       l5:     ldx #0xe
> > > >       l6:     jeq #0x86dd, l41, l7
> > > >       l8:     ld [x+16]
> > > >       l9:     ja 41
> > > > 
> > > >         [... repeated ja 41 ]
> > > > 
> > > >       l40:    ja 41
> > > >       l41:    ret #0
> > > >       l42:    ld #len
> > > >       l43:    ret a
> > > > 
> > > > This bpf program contains 32 "ja 41" instructions which are effectively
> > > > NOPs and designed to be replaced with valid code dynamically. Ideally,
> > > > bpf jit should optimize those "ja 41" instructions out when translating
> > > > the bpf instructions into x86_64 machine code. However, do_jit() can
> > > > only remove one "ja 41" for offset==0 on each pass, so it requires at
> > > > least 32 runs to eliminate those JMPs and exceeds the current limit of
> > > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> > > > is set even though it's legit as a classic socket filter.
> > > > 
> > > > Since this kind of programs are usually handcrafted rather than
> > > > generated by LLVM, those programs tend to be small. To avoid increasing
> > > > the complexity of BPF JIT, this commit just bumps the number of passes
> > > > to 64 as suggested by Daniel to make it less likely to fail on such cases.
> > > > 
> > > 
> > > Another idea would be to stop trying to reduce size of generated
> > > code after a given number of passes have been attempted.
> > > 
> > > Because even a limit of 64 wont ensure all 'valid' programs can be JITed.
> > 
> > +1.
> > Bumping the limit is not solving anything.
> > It only allows bad actors force kernel to spend more time in JIT.
> > If we're holding locks the longer looping may cause issues.
> > I think JIT is parallel enough, but still it's a concern.
> > 
> > I wonder how assemblers deal with it?
> > They probably face the same issue.
> > 
> > Instead of going back to 32-bit jumps and suddenly increase image size
> > I think we can do nop padding instead.
> > After few loops every insn is more or less optimal.
> > I think the fix could be something like:
> >   if (is_imm8(jmp_offset)) {
> >        EMIT2(jmp_cond, jmp_offset);
> >        if (loop_cnt > 5) {
> >           EMIT N nops
> >           where N = addrs[i] - addrs[i - 1]; // not sure about this math.
> >           N can be 0 or 4 here.
> >           // or may be NOPs should be emitted before EMIT2.
> >           // need to think it through
> >        }
> >   }
> This looks promising. Once we switch to nop padding, the image is likely
> to converge soon. Maybe we can postpone the padding to the last 5 passes
> so that do_jit() could optimize the image a bit more.
> 
> > Will something like this work?
> > I think that's what you're suggesting, right?
> > 
> Besides nop padding, the optimization for 0 offset jump also has to be
> disabled since it's actually the one causing image shrinking in my case.
> 
> Gary Lin
> 

Here is my testing patch. My sample program got accepted with this
patch. Haven't done the further test though.

---
 arch/x86/net/bpf_jit_comp.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 796506dcfc42..6a39c5ba6383 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -790,7 +790,7 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
 }
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
-		  int oldproglen, struct jit_context *ctx)
+		  int oldproglen, struct jit_context *ctx, bool ja_padding)
 {
 	bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
 	struct bpf_insn *insn = bpf_prog->insnsi;
@@ -800,6 +800,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
 	int i, cnt = 0, excnt = 0;
+	int p;
 	int proglen = 0;
 	u8 *prog = temp;
 
@@ -1410,6 +1411,11 @@ xadd:			if (is_imm8(insn->off))
 			jmp_offset = addrs[i + insn->off] - addrs[i];
 			if (is_imm8(jmp_offset)) {
 				EMIT2(jmp_cond, jmp_offset);
+				ilen = prog - temp;
+				if (ja_padding && (addrs[i] - addrs[i-1]) > ilen) {
+					for (p = 0; p < 4; p++)
+						EMIT1(0x90);
+				}
 			} else if (is_simm32(jmp_offset)) {
 				EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
 			} else {
@@ -1431,12 +1437,17 @@ xadd:			if (is_imm8(insn->off))
 			else
 				jmp_offset = addrs[i + insn->off] - addrs[i];
 
-			if (!jmp_offset)
+			if (!jmp_offset && !ja_padding)
 				/* Optimize out nop jumps */
 				break;
 emit_jmp:
 			if (is_imm8(jmp_offset)) {
 				EMIT2(0xEB, jmp_offset);
+				ilen = prog - temp;
+				if (ja_padding && (addrs[i] - addrs[i-1]) > ilen) {
+					for (p = 0; p < 4; p++)
+						EMIT1(0x90);
+				}
 			} else if (is_simm32(jmp_offset)) {
 				EMIT1_off32(0xE9, jmp_offset);
 			} else {
@@ -1972,6 +1983,9 @@ struct x64_jit_data {
 	struct jit_context ctx;
 };
 
+#define MAX_PASSES 20
+#define PADDING_PASSES (MAX_PASSES - 5)
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_binary_header *header = NULL;
@@ -2043,7 +2057,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	 * pass to emit the final image.
 	 */
 	for (pass = 0; pass < 20 || image; pass++) {
-		proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
+		if (pass < PADDING_PASSES)
+			proglen = do_jit(prog, addrs, image, oldproglen, &ctx, false);
+		else
+			proglen = do_jit(prog, addrs, image, oldproglen, &ctx, true);
 		if (proglen <= 0) {
 out_image:
 			image = NULL;