On Wed, Mar 09, 2022 at 12:58 AM +01, Ilya Leoshkevich wrote: > On Tue, 2022-03-08 at 16:01 +0100, Jakub Sitnicki wrote: >> On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote: >> > Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for >> > backward compatibility, regardless of what the uapi headers say. >> > This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport >> > field. >> > Therefore, accessing the most significant 16 bits of >> > bpf_sk_lookup.remote_port must produce 0, which is currently not >> > the case. >> > >> > The problem is that narrow loads with offset - commit 46f53a65d2de >> > ("bpf: Allow narrow loads with offset > 0"), don't play nicely with >> > the masking optimization - commit 239946314e57 ("bpf: possibly >> > avoid >> > extra masking for narrower load in verifier"). In particular, when >> > we >> > suppress extra masking, we suppress shifting as well, which is not >> > correct. >> > >> > Fix by moving the masking suppression check to BPF_AND generation. >> > >> > Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0") >> > Signed-off-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx> >> > --- >> > kernel/bpf/verifier.c | 14 +++++++++----- >> > 1 file changed, 9 insertions(+), 5 deletions(-) >> > >> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> > index d7473fee247c..195f2e9b5a47 100644 >> > --- a/kernel/bpf/verifier.c >> > +++ b/kernel/bpf/verifier.c >> > @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct >> > bpf_verifier_env *env) >> > return -EINVAL; >> > } >> > >> > - if (is_narrower_load && size < target_size) { >> > + if (is_narrower_load) { >> > u8 shift = bpf_ctx_narrow_access_offset( >> > off, size, size_default) * 8; >> > if (shift && cnt + 1 >= >> > ARRAY_SIZE(insn_buf)) { >> > @@ -12860,15 +12860,19 @@ static int convert_ctx_accesses(struct >> > bpf_verifier_env *env) >> > insn_buf[cnt++] = >> > BPF_ALU32_IMM(BPF_RSH, >> > >> > insn->dst_reg, >> > >> > shift); >> > - insn_buf[cnt++] = >> > BPF_ALU32_IMM(BPF_AND, insn->dst_reg, >> > - (1 >> > << size * 8) - 1); >> > + if (size < target_size) >> > + insn_buf[cnt++] = >> > BPF_ALU32_IMM( >> > + BPF_AND, insn- >> > >dst_reg, >> > + (1 << size * 8) - >> > 1); >> > } else { >> > if (shift) >> > insn_buf[cnt++] = >> > BPF_ALU64_IMM(BPF_RSH, >> > >> > insn->dst_reg, >> > >> > shift); >> > - insn_buf[cnt++] = >> > BPF_ALU64_IMM(BPF_AND, insn->dst_reg, >> > - >> > (1ULL >> > << size * 8) - 1); >> > + if (size < target_size) >> > + insn_buf[cnt++] = >> > BPF_ALU64_IMM( >> > + BPF_AND, insn- >> > >dst_reg, >> > + (1ULL << size * 8) >> > - 1); >> > } >> > } >> >> Thanks for patience. I'm coming back to this. >> >> This fix affects the 2-byte load from bpf_sk_lookup.remote_port. >> Dumping the xlated BPF code confirms it. >> >> On LE (x86-64) things look well. >> >> Before this patch: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (b7) r0 = 0 >> 2: (95) exit >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (b7) r0 = 0 >> 2: (95) exit >> >> After this patch: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (b7) r0 = 0 >> 2: (95) exit >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (74) w2 >>= 16 >> 2: (b7) r0 = 0 >> 3: (95) exit >> >> Which works great because the JIT generates a zero-extended load >> movzwq: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> bpf_prog_5e4fe3dbdcb18fd3: >> 0: nopl 0x0(%rax,%rax,1) >> 5: xchg %ax,%ax >> 7: push %rbp >> 8: mov %rsp,%rbp >> b: movzwq 0x4(%rdi),%rsi >> 10: xor %eax,%eax >> 12: leave >> 13: ret >> >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> bpf_prog_4a6336c64a340b96: >> 0: nopl 0x0(%rax,%rax,1) >> 5: xchg %ax,%ax >> 7: push %rbp >> 8: mov %rsp,%rbp >> b: movzwq 0x4(%rdi),%rsi >> 10: shr $0x10,%esi >> 13: xor %eax,%eax >> 15: leave >> 16: ret >> >> Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes of >> zero padding following it, like below, pass with flying colors: >> >> ok = ctx->remote_port == bpf_htons(8008); >> if (!ok) >> return SK_DROP; >> ok = *((__u16 *)&ctx->remote_port + 1) == 0; >> if (!ok) >> return SK_DROP; >> >> (The above checks compile to half-word (2-byte) loads.) >> >> >> On BE (s390x) things look different: >> >> Before the patch: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (bc) w2 = w2 >> 2: (b7) r0 = 0 >> 3: (95) exit >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (bc) w2 = w2 >> 2: (b7) r0 = 0 >> 3: (95) exit >> >> After the patch: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (bc) w2 = w2 >> 2: (74) w2 >>= 16 >> 3: (bc) w2 = w2 >> 4: (b7) r0 = 0 >> 5: (95) exit >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> 0: (69) r2 = *(u16 *)(r1 +4) >> 1: (bc) w2 = w2 >> 2: (b7) r0 = 0 >> 3: (95) exit >> >> These compile to: >> >> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) >> bpf_prog_fdd58b8caca29f00: >> 0: j 0x0000000000000006 >> 4: nopr >> 6: stmg %r11,%r15,112(%r15) >> c: la %r13,64(%r15) >> 10: aghi %r15,-96 >> 14: llgh %r3,4(%r2,%r0) >> 1a: srl %r3,16 >> 1e: llgfr %r3,%r3 >> 22: lgfi %r14,0 >> 28: lgr %r2,%r14 >> 2c: lmg %r11,%r15,208(%r15) >> 32: br %r14 >> >> >> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) >> bpf_prog_5e3d8e92223c6841: >> 0: j 0x0000000000000006 >> 4: nopr >> 6: stmg %r11,%r15,112(%r15) >> c: la %r13,64(%r15) >> 10: aghi %r15,-96 >> 14: llgh %r3,4(%r2,%r0) >> 1a: lgfi %r14,0 >> 20: lgr %r2,%r14 >> 24: lmg %r11,%r15,208(%r15) >> 2a: br %r14 >> >> Now, we right shift the value when loading >> >> *(u16 *)(r1 +36) >> >> which in C BPF is equivalent to >> >> *((__u16 *)&ctx->remote_port + 0) >> >> due to how the shift is calculated by bpf_ctx_narrow_access_offset(). > > Right, that's exactly the intention here. > The way I see the situation is: the ABI forces us to treat remote_port > as a 32-bit field, even though the updated header now says otherwise. > And this: > > unsigned int remote_port; > unsigned short result = *(unsigned short *)remote_port; > > should be the same as: > > unsigned short result = remote_port >> 16; > > on big-endian. Note that this is inherently non-portable. > >> This makes the expected typical use-case >> >> ctx->remote_port == bpf_htons(8008) >> >> fail on s390x because llgh (Load Logical Halfword (64<-16)) seems to >> lay >> out the data in the destination register so that it holds >> 0x0000_0000_0000_1f48. >> >> I don't know that was the intention here, as it makes the BPF C code >> non-portable. >> >> WDYT? > > This depends on how we define the remote_port field. I would argue that > the definition from patch 2 - even though ugly - is the correct one. > It is consistent with both the little-endian (1f 48 00 00) and > big-endian (00 00 1f 48) ABIs. > > I don't think the current definition is correct, because it expects > 1f 48 00 00 on big-endian, and this is not the case. We can verify this > by taking 9a69e2^ and applying > > --- a/tools/testing/selftests/bpf/progs/test_sk_lookup.c > +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup.c > @@ -417,6 +417,8 @@ int ctx_narrow_access(struct bpf_sk_lookup *ctx) > return SK_DROP; > if (LSW(ctx->remote_port, 0) != SRC_PORT) > return SK_DROP; > + if (ctx->remote_port != SRC_PORT) > + return SK_DROP; > > /* Narrow loads from local_port field. Expect DST_PORT. */ > if (LSB(ctx->local_port, 0) != ((DST_PORT >> 0) & 0xff) || > > Therefore that > > ctx->remote_port == bpf_htons(8008) > > fails without patch 2 is as expected. > Consider this - today the below is true on both LE and BE, right? *(u32 *)&ctx->remote_port == *(u16 *)&ctx->remote_port because the loads get converted to: *(u16 *)&ctx_kern->sport == *(u16 *)&ctx_kern->sport IOW, today, because of the bug that you are fixing here, the data layout changes from the PoV of the BPF program depending on the load size. With 2-byte loads, without this patch, the data layout appears as: struct bpf_sk_lookup { ... __be16 remote_port; __be16 remote_port; ... } While for 4-byte loads, it appears as in your 2nd patch: struct bpf_sk_lookup { ... #if little-endian __be16 remote_port; __u16 :16; /* zero padding */ #elif big-endian __u16 :16; /* zero padding */ __be16 remote_port; #endif ... } Because of that I don't see how we could keep complete ABI compatiblity, and have just one definition of struct bpf_sk_lookup that reflects it. These are conflicting requirements. I'd bite the bullet for 4-byte loads, for the sake of having an endian-agnostic struct bpf_sk_lookup and struct bpf_sock definition in the uAPI header. The sacrifice here is that the access converter will have to keep rewriting 4-byte access to bpf_sk_lookup.remote_port and bpf_sock.dst_port in this unexpected, quirky manner. The expectation is that with time users will recompile their BPF progs against the updated bpf.h, and switch to 2-byte loads. That will make the quirk in the access converter dead code in time. I don't have any better ideas. Sorry. [...]