On Tue, 2022-03-08 at 16:01 +0100, Jakub Sitnicki wrote: > On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote: > > Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for > > backward compatibility, regardless of what the uapi headers say. > > This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport > > field. > > Therefore, accessing the most significant 16 bits of > > bpf_sk_lookup.remote_port must produce 0, which is currently not > > the case. > > > > The problem is that narrow loads with offset - commit 46f53a65d2de > > ("bpf: Allow narrow loads with offset > 0"), don't play nicely with > > the masking optimization - commit 239946314e57 ("bpf: possibly > > avoid > > extra masking for narrower load in verifier"). In particular, when > > we > > suppress extra masking, we suppress shifting as well, which is not > > correct. > > > > Fix by moving the masking suppression check to BPF_AND generation. > > > > Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0") > > Signed-off-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx> > > --- > > kernel/bpf/verifier.c | 14 +++++++++----- > > 1 file changed, 9 insertions(+), 5 deletions(-) > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > index d7473fee247c..195f2e9b5a47 100644 > > --- a/kernel/bpf/verifier.c > > +++ b/kernel/bpf/verifier.c > > @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct > > bpf_verifier_env *env) > > return -EINVAL; > > } > > > > - if (is_narrower_load && size < target_size) { > > + if (is_narrower_load) { > > u8 shift = bpf_ctx_narrow_access_offset( > > off, size, size_default) * 8; > > if (shift && cnt + 1 >= > > ARRAY_SIZE(insn_buf)) { > > @@ -12860,15 +12860,19 @@ static int convert_ctx_accesses(struct > > bpf_verifier_env *env) > > insn_buf[cnt++] = > > BPF_ALU32_IMM(BPF_RSH, > > > > insn->dst_reg, > > > > shift); > > - insn_buf[cnt++] = > > BPF_ALU32_IMM(BPF_AND, insn->dst_reg, > > - (1 > > << size * 8) - 1); > > + if (size < target_size) > > + insn_buf[cnt++] = > > BPF_ALU32_IMM( > > + BPF_AND, insn- > > >dst_reg, > > + (1 << size * 8) - > > 1); > > } else { > > if (shift) > > insn_buf[cnt++] = > > BPF_ALU64_IMM(BPF_RSH, > > > > insn->dst_reg, > > > > shift); > > - insn_buf[cnt++] = > > BPF_ALU64_IMM(BPF_AND, insn->dst_reg, > > - > > (1ULL > > << size * 8) - 1); > > + if (size < target_size) > > + insn_buf[cnt++] = > > BPF_ALU64_IMM( > > + BPF_AND, insn- > > >dst_reg, > > + (1ULL << size * 8) > > - 1); > > } > > } > > Thanks for patience. I'm coming back to this. > > This fix affects the 2-byte load from bpf_sk_lookup.remote_port. > Dumping the xlated BPF code confirms it. > > On LE (x86-64) things look well. > > Before this patch: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (b7) r0 = 0 > 2: (95) exit > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (b7) r0 = 0 > 2: (95) exit > > After this patch: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (b7) r0 = 0 > 2: (95) exit > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (74) w2 >>= 16 > 2: (b7) r0 = 0 > 3: (95) exit > > Which works great because the JIT generates a zero-extended load > movzwq: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > bpf_prog_5e4fe3dbdcb18fd3: > 0: nopl 0x0(%rax,%rax,1) > 5: xchg %ax,%ax > 7: push %rbp > 8: mov %rsp,%rbp > b: movzwq 0x4(%rdi),%rsi > 10: xor %eax,%eax > 12: leave > 13: ret > > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > bpf_prog_4a6336c64a340b96: > 0: nopl 0x0(%rax,%rax,1) > 5: xchg %ax,%ax > 7: push %rbp > 8: mov %rsp,%rbp > b: movzwq 0x4(%rdi),%rsi > 10: shr $0x10,%esi > 13: xor %eax,%eax > 15: leave > 16: ret > > Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes of > zero padding following it, like below, pass with flying colors: > > ok = ctx->remote_port == bpf_htons(8008); > if (!ok) > return SK_DROP; > ok = *((__u16 *)&ctx->remote_port + 1) == 0; > if (!ok) > return SK_DROP; > > (The above checks compile to half-word (2-byte) loads.) > > > On BE (s390x) things look different: > > Before the patch: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (bc) w2 = w2 > 2: (b7) r0 = 0 > 3: (95) exit > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (bc) w2 = w2 > 2: (b7) r0 = 0 > 3: (95) exit > > After the patch: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (bc) w2 = w2 > 2: (74) w2 >>= 16 > 3: (bc) w2 = w2 > 4: (b7) r0 = 0 > 5: (95) exit > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > 0: (69) r2 = *(u16 *)(r1 +4) > 1: (bc) w2 = w2 > 2: (b7) r0 = 0 > 3: (95) exit > > These compile to: > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) > bpf_prog_fdd58b8caca29f00: > 0: j 0x0000000000000006 > 4: nopr > 6: stmg %r11,%r15,112(%r15) > c: la %r13,64(%r15) > 10: aghi %r15,-96 > 14: llgh %r3,4(%r2,%r0) > 1a: srl %r3,16 > 1e: llgfr %r3,%r3 > 22: lgfi %r14,0 > 28: lgr %r2,%r14 > 2c: lmg %r11,%r15,208(%r15) > 32: br %r14 > > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) > bpf_prog_5e3d8e92223c6841: > 0: j 0x0000000000000006 > 4: nopr > 6: stmg %r11,%r15,112(%r15) > c: la %r13,64(%r15) > 10: aghi %r15,-96 > 14: llgh %r3,4(%r2,%r0) > 1a: lgfi %r14,0 > 20: lgr %r2,%r14 > 24: lmg %r11,%r15,208(%r15) > 2a: br %r14 > > Now, we right shift the value when loading > > *(u16 *)(r1 +36) > > which in C BPF is equivalent to > > *((__u16 *)&ctx->remote_port + 0) > > due to how the shift is calculated by bpf_ctx_narrow_access_offset(). Right, that's exactly the intention here. The way I see the situation is: the ABI forces us to treat remote_port as a 32-bit field, even though the updated header now says otherwise. And this: unsigned int remote_port; unsigned short result = *(unsigned short *)remote_port; should be the same as: unsigned short result = remote_port >> 16; on big-endian. Note that this is inherently non-portable. > This makes the expected typical use-case > > ctx->remote_port == bpf_htons(8008) > > fail on s390x because llgh (Load Logical Halfword (64<-16)) seems to > lay > out the data in the destination register so that it holds > 0x0000_0000_0000_1f48. > > I don't know that was the intention here, as it makes the BPF C code > non-portable. > > WDYT? This depends on how we define the remote_port field. I would argue that the definition from patch 2 - even though ugly - is the correct one. It is consistent with both the little-endian (1f 48 00 00) and big-endian (00 00 1f 48) ABIs. I don't think the current definition is correct, because it expects 1f 48 00 00 on big-endian, and this is not the case. We can verify this by taking 9a69e2^ and applying --- a/tools/testing/selftests/bpf/progs/test_sk_lookup.c +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup.c @@ -417,6 +417,8 @@ int ctx_narrow_access(struct bpf_sk_lookup *ctx) return SK_DROP; if (LSW(ctx->remote_port, 0) != SRC_PORT) return SK_DROP; + if (ctx->remote_port != SRC_PORT) + return SK_DROP; /* Narrow loads from local_port field. Expect DST_PORT. */ if (LSB(ctx->local_port, 0) != ((DST_PORT >> 0) & 0xff) || Therefore that ctx->remote_port == bpf_htons(8008) fails without patch 2 is as expected. > BTW. Out of curiosity, how does a Logical Load Halfword (llgh) differ > differ from a non-logical Load Halfword (lgh) on s390x? Compiler > Explorer generates a non-logical load for similar C code. The logical one does zero extension, and the regular one does sign extension. The following unsigned long foo(unsigned short *bar) { return *bar; } is compiled to foo(unsigned short*): llgh %r2,0(%r2) br %r14 with -O3. Without -O3 zeroing out the upper bits is done using the sllg and srlg (left and right logical shifts respectively).