On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote: > Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for > backward compatibility, regardless of what the uapi headers say. > This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport field. > Therefore, accessing the most significant 16 bits of > bpf_sk_lookup.remote_port must produce 0, which is currently not > the case. > > The problem is that narrow loads with offset - commit 46f53a65d2de > ("bpf: Allow narrow loads with offset > 0"), don't play nicely with > the masking optimization - commit 239946314e57 ("bpf: possibly avoid > extra masking for narrower load in verifier"). In particular, when we > suppress extra masking, we suppress shifting as well, which is not > correct. > > Fix by moving the masking suppression check to BPF_AND generation. > > Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0") > Signed-off-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx> > --- > kernel/bpf/verifier.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index d7473fee247c..195f2e9b5a47 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) > return -EINVAL; > } > > - if (is_narrower_load && size < target_size) { > + if (is_narrower_load) { > u8 shift = bpf_ctx_narrow_access_offset( > off, size, size_default) * 8; > if (shift && cnt + 1 >= ARRAY_SIZE(insn_buf)) { > @@ -12860,15 +12860,19 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) > insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, > insn->dst_reg, > shift); > - insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, > - (1 << size * 8) - 1); > + if (size < target_size) > + insn_buf[cnt++] = BPF_ALU32_IMM( > + BPF_AND, insn->dst_reg, > + (1 << size * 8) - 1); > } else { > if (shift) > insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, > insn->dst_reg, > shift); > - insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg, > - (1ULL << size * 8) - 1); > + if (size < target_size) > + insn_buf[cnt++] = BPF_ALU64_IMM( > + BPF_AND, insn->dst_reg, > + (1ULL << size * 8) - 1); > } > } Thanks for patience. I'm coming back to this. This fix affects the 2-byte load from bpf_sk_lookup.remote_port. Dumping the xlated BPF code confirms it. On LE (x86-64) things look well. Before this patch: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) 0: (69) r2 = *(u16 *)(r1 +4) 1: (b7) r0 = 0 2: (95) exit * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) 0: (69) r2 = *(u16 *)(r1 +4) 1: (b7) r0 = 0 2: (95) exit After this patch: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) 0: (69) r2 = *(u16 *)(r1 +4) 1: (b7) r0 = 0 2: (95) exit * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) 0: (69) r2 = *(u16 *)(r1 +4) 1: (74) w2 >>= 16 2: (b7) r0 = 0 3: (95) exit Which works great because the JIT generates a zero-extended load movzwq: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) bpf_prog_5e4fe3dbdcb18fd3: 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: movzwq 0x4(%rdi),%rsi 10: xor %eax,%eax 12: leave 13: ret * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) bpf_prog_4a6336c64a340b96: 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: movzwq 0x4(%rdi),%rsi 10: shr $0x10,%esi 13: xor %eax,%eax 15: leave 16: ret Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes of zero padding following it, like below, pass with flying colors: ok = ctx->remote_port == bpf_htons(8008); if (!ok) return SK_DROP; ok = *((__u16 *)&ctx->remote_port + 1) == 0; if (!ok) return SK_DROP; (The above checks compile to half-word (2-byte) loads.) On BE (s390x) things look different: Before the patch: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) 0: (69) r2 = *(u16 *)(r1 +4) 1: (bc) w2 = w2 2: (b7) r0 = 0 3: (95) exit * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) 0: (69) r2 = *(u16 *)(r1 +4) 1: (bc) w2 = w2 2: (b7) r0 = 0 3: (95) exit After the patch: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) 0: (69) r2 = *(u16 *)(r1 +4) 1: (bc) w2 = w2 2: (74) w2 >>= 16 3: (bc) w2 = w2 4: (b7) r0 = 0 5: (95) exit * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) 0: (69) r2 = *(u16 *)(r1 +4) 1: (bc) w2 = w2 2: (b7) r0 = 0 3: (95) exit These compile to: * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36) bpf_prog_fdd58b8caca29f00: 0: j 0x0000000000000006 4: nopr 6: stmg %r11,%r15,112(%r15) c: la %r13,64(%r15) 10: aghi %r15,-96 14: llgh %r3,4(%r2,%r0) 1a: srl %r3,16 1e: llgfr %r3,%r3 22: lgfi %r14,0 28: lgr %r2,%r14 2c: lmg %r11,%r15,208(%r15) 32: br %r14 * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38) bpf_prog_5e3d8e92223c6841: 0: j 0x0000000000000006 4: nopr 6: stmg %r11,%r15,112(%r15) c: la %r13,64(%r15) 10: aghi %r15,-96 14: llgh %r3,4(%r2,%r0) 1a: lgfi %r14,0 20: lgr %r2,%r14 24: lmg %r11,%r15,208(%r15) 2a: br %r14 Now, we right shift the value when loading *(u16 *)(r1 +36) which in C BPF is equivalent to *((__u16 *)&ctx->remote_port + 0) due to how the shift is calculated by bpf_ctx_narrow_access_offset(). This makes the expected typical use-case ctx->remote_port == bpf_htons(8008) fail on s390x because llgh (Load Logical Halfword (64<-16)) seems to lay out the data in the destination register so that it holds 0x0000_0000_0000_1f48. I don't know that was the intention here, as it makes the BPF C code non-portable. WDYT? BTW. Out of curiosity, how does a Logical Load Halfword (llgh) differ differ from a non-logical Load Halfword (lgh) on s390x? Compiler Explorer generates a non-logical load for similar C code.