Re: [PATCH RFC bpf-next 1/3] bpf: Fix certain narrow loads with offsets

Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> · Wed, 09 Mar 2022 09:36:33 +0100

On Wed, Mar 09, 2022 at 12:58 AM +01, Ilya Leoshkevich wrote:
> On Tue, 2022-03-08 at 16:01 +0100, Jakub Sitnicki wrote:
>> On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote:
>> > Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for
>> > backward compatibility, regardless of what the uapi headers say.
>> > This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport
>> > field.
>> > Therefore, accessing the most significant 16 bits of
>> > bpf_sk_lookup.remote_port must produce 0, which is currently not
>> > the case.
>> > 
>> > The problem is that narrow loads with offset - commit 46f53a65d2de
>> > ("bpf: Allow narrow loads with offset > 0"), don't play nicely with
>> > the masking optimization - commit 239946314e57 ("bpf: possibly
>> > avoid
>> > extra masking for narrower load in verifier"). In particular, when
>> > we
>> > suppress extra masking, we suppress shifting as well, which is not
>> > correct.
>> > 
>> > Fix by moving the masking suppression check to BPF_AND generation.
>> > 
>> > Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0")
>> > Signed-off-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx>
>> > ---
>> >  kernel/bpf/verifier.c | 14 +++++++++-----
>> >  1 file changed, 9 insertions(+), 5 deletions(-)
>> > 
>> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> > index d7473fee247c..195f2e9b5a47 100644
>> > --- a/kernel/bpf/verifier.c
>> > +++ b/kernel/bpf/verifier.c
>> > @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct
>> > bpf_verifier_env *env)
>> >                         return -EINVAL;
>> >                 }
>> >  
>> > -               if (is_narrower_load && size < target_size) {
>> > +               if (is_narrower_load) {
>> >                         u8 shift = bpf_ctx_narrow_access_offset(
>> >                                 off, size, size_default) * 8;
>> >                         if (shift && cnt + 1 >=
>> > ARRAY_SIZE(insn_buf)) {
>> > @@ -12860,15 +12860,19 @@ static int convert_ctx_accesses(struct
>> > bpf_verifier_env *env)
>> >                                         insn_buf[cnt++] =
>> > BPF_ALU32_IMM(BPF_RSH,
>> >                                                                    
>> >      insn->dst_reg,
>> >                                                                    
>> >      shift);
>> > -                               insn_buf[cnt++] =
>> > BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
>> > -                                                               (1
>> > << size * 8) - 1);
>> > +                               if (size < target_size)
>> > +                                       insn_buf[cnt++] =
>> > BPF_ALU32_IMM(
>> > +                                               BPF_AND, insn-
>> > >dst_reg,
>> > +                                               (1 << size * 8) -
>> > 1);
>> >                         } else {
>> >                                 if (shift)
>> >                                         insn_buf[cnt++] =
>> > BPF_ALU64_IMM(BPF_RSH,
>> >                                                                    
>> >      insn->dst_reg,
>> >                                                                    
>> >      shift);
>> > -                               insn_buf[cnt++] =
>> > BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
>> > -
>> >                                                                (1ULL
>> >  << size * 8) - 1);
>> > +                               if (size < target_size)
>> > +                                       insn_buf[cnt++] =
>> > BPF_ALU64_IMM(
>> > +                                               BPF_AND, insn-
>> > >dst_reg,
>> > +                                               (1ULL << size * 8)
>> > - 1);
>> >                         }
>> >                 }
>> 
>> Thanks for patience. I'm coming back to this.
>> 
>> This fix affects the 2-byte load from bpf_sk_lookup.remote_port.
>> Dumping the xlated BPF code confirms it.
>> 
>> On LE (x86-64) things look well.
>> 
>> Before this patch:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (b7) r0 = 0
>>    2: (95) exit
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (b7) r0 = 0
>>    2: (95) exit
>> 
>> After this patch:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (b7) r0 = 0
>>    2: (95) exit
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (74) w2 >>= 16
>>    2: (b7) r0 = 0
>>    3: (95) exit
>> 
>> Which works great because the JIT generates a zero-extended load
>> movzwq:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>> bpf_prog_5e4fe3dbdcb18fd3:
>>    0:   nopl   0x0(%rax,%rax,1)
>>    5:   xchg   %ax,%ax
>>    7:   push   %rbp
>>    8:   mov    %rsp,%rbp
>>    b:   movzwq 0x4(%rdi),%rsi
>>   10:   xor    %eax,%eax
>>   12:   leave
>>   13:   ret
>> 
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>> bpf_prog_4a6336c64a340b96:
>>    0:   nopl   0x0(%rax,%rax,1)
>>    5:   xchg   %ax,%ax
>>    7:   push   %rbp
>>    8:   mov    %rsp,%rbp
>>    b:   movzwq 0x4(%rdi),%rsi
>>   10:   shr    $0x10,%esi
>>   13:   xor    %eax,%eax
>>   15:   leave
>>   16:   ret
>> 
>> Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes of
>> zero padding following it, like below, pass with flying colors:
>> 
>>         ok = ctx->remote_port == bpf_htons(8008);
>>         if (!ok)
>>                 return SK_DROP;
>>         ok = *((__u16 *)&ctx->remote_port + 1) == 0;
>>         if (!ok)
>>                 return SK_DROP;
>> 
>> (The above checks compile to half-word (2-byte) loads.)
>> 
>> 
>> On BE (s390x) things look different:
>> 
>> Before the patch:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (bc) w2 = w2
>>    2: (b7) r0 = 0
>>    3: (95) exit
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (bc) w2 = w2
>>    2: (b7) r0 = 0
>>    3: (95) exit
>> 
>> After the patch:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (bc) w2 = w2
>>    2: (74) w2 >>= 16
>>    3: (bc) w2 = w2
>>    4: (b7) r0 = 0
>>    5: (95) exit
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>>    0: (69) r2 = *(u16 *)(r1 +4)
>>    1: (bc) w2 = w2
>>    2: (b7) r0 = 0
>>    3: (95) exit
>> 
>> These compile to:
>> 
>> * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
>> bpf_prog_fdd58b8caca29f00:
>>    0:   j       0x0000000000000006
>>    4:   nopr
>>    6:   stmg    %r11,%r15,112(%r15)
>>    c:   la      %r13,64(%r15)
>>   10:   aghi    %r15,-96
>>   14:   llgh    %r3,4(%r2,%r0)
>>   1a:   srl     %r3,16
>>   1e:   llgfr   %r3,%r3
>>   22:   lgfi    %r14,0
>>   28:   lgr     %r2,%r14
>>   2c:   lmg     %r11,%r15,208(%r15)
>>   32:   br      %r14
>> 
>> 
>> * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
>> bpf_prog_5e3d8e92223c6841:
>>    0:   j       0x0000000000000006
>>    4:   nopr
>>    6:   stmg    %r11,%r15,112(%r15)
>>    c:   la      %r13,64(%r15)
>>   10:   aghi    %r15,-96
>>   14:   llgh    %r3,4(%r2,%r0)
>>   1a:   lgfi    %r14,0
>>   20:   lgr     %r2,%r14
>>   24:   lmg     %r11,%r15,208(%r15)
>>   2a:   br      %r14
>> 
>> Now, we right shift the value when loading
>> 
>>   *(u16 *)(r1 +36)
>> 
>> which in C BPF is equivalent to
>> 
>>   *((__u16 *)&ctx->remote_port + 0)
>> 
>> due to how the shift is calculated by bpf_ctx_narrow_access_offset().
>
> Right, that's exactly the intention here.
> The way I see the situation is: the ABI forces us to treat remote_port
> as a 32-bit field, even though the updated header now says otherwise.
> And this:
>
>     unsigned int remote_port;
>     unsigned short result = *(unsigned short *)remote_port;
>
> should be the same as:
>
>     unsigned short result = remote_port >> 16;
>
> on big-endian. Note that this is inherently non-portable.

>
>> This makes the expected typical use-case
>> 
>>   ctx->remote_port == bpf_htons(8008)
>> 
>> fail on s390x because llgh (Load Logical Halfword (64<-16)) seems to
>> lay
>> out the data in the destination register so that it holds
>> 0x0000_0000_0000_1f48.
>> 
>> I don't know that was the intention here, as it makes the BPF C code
>> non-portable.
>> 
>> WDYT?
>
> This depends on how we define the remote_port field. I would argue that
> the definition from patch 2 - even though ugly - is the correct one.
> It is consistent with both the little-endian (1f 48 00 00) and
> big-endian (00 00 1f 48) ABIs.
>
> I don't think the current definition is correct, because it expects
> 1f 48 00 00 on big-endian, and this is not the case. We can verify this
> by taking 9a69e2^ and applying
>
> --- a/tools/testing/selftests/bpf/progs/test_sk_lookup.c
> +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup.c
> @@ -417,6 +417,8 @@ int ctx_narrow_access(struct bpf_sk_lookup *ctx)
>                 return SK_DROP;
>         if (LSW(ctx->remote_port, 0) != SRC_PORT)
>                 return SK_DROP;
> +       if (ctx->remote_port != SRC_PORT)
> +               return SK_DROP;
>  
>         /* Narrow loads from local_port field. Expect DST_PORT. */
>         if (LSB(ctx->local_port, 0) != ((DST_PORT >> 0) & 0xff) ||
>
> Therefore that
>
>   ctx->remote_port == bpf_htons(8008)
>
> fails without patch 2 is as expected.
>

Consider this - today the below is true on both LE and BE, right?

  *(u32 *)&ctx->remote_port == *(u16 *)&ctx->remote_port

because the loads get converted to:

  *(u16 *)&ctx_kern->sport == *(u16 *)&ctx_kern->sport

IOW, today, because of the bug that you are fixing here, the data layout
changes from the PoV of the BPF program depending on the load size.

With 2-byte loads, without this patch, the data layout appears as:

  struct bpf_sk_lookup {
    ...
    __be16 remote_port;
    __be16 remote_port;
    ...
  }

While for 4-byte loads, it appears as in your 2nd patch:

  struct bpf_sk_lookup {
    ...
    #if little-endian
    __be16 remote_port;
    __u16  :16; /* zero padding */
    #elif big-endian
    __u16  :16; /* zero padding */
    __be16 remote_port;
    #endif
    ...
  }

Because of that I don't see how we could keep complete ABI compatiblity,
and have just one definition of struct bpf_sk_lookup that reflects
it. These are conflicting requirements.

I'd bite the bullet for 4-byte loads, for the sake of having an
endian-agnostic struct bpf_sk_lookup and struct bpf_sock definition in
the uAPI header.

The sacrifice here is that the access converter will have to keep
rewriting 4-byte access to bpf_sk_lookup.remote_port and
bpf_sock.dst_port in this unexpected, quirky manner.

The expectation is that with time users will recompile their BPF progs
against the updated bpf.h, and switch to 2-byte loads. That will make
the quirk in the access converter dead code in time.

I don't have any better ideas. Sorry.

[...]