Re: [PATCH RFC bpf-next 1/3] bpf: Fix certain narrow loads with offsets

Ilya Leoshkevich <iii@xxxxxxxxxxxxx> · Wed, 09 Mar 2022 13:34:06 +0100

On Wed, 2022-03-09 at 09:36 +0100, Jakub Sitnicki wrote:
> On Wed, Mar 09, 2022 at 12:58 AM +01, Ilya Leoshkevich wrote:
> > On Tue, 2022-03-08 at 16:01 +0100, Jakub Sitnicki wrote:
> > > On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote:
> > > > Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for
> > > > backward compatibility, regardless of what the uapi headers
> > > > say.
> > > > This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport
> > > > field.
> > > > Therefore, accessing the most significant 16 bits of
> > > > bpf_sk_lookup.remote_port must produce 0, which is currently
> > > > not
> > > > the case.
> > > > 
> > > > The problem is that narrow loads with offset - commit
> > > > 46f53a65d2de
> > > > ("bpf: Allow narrow loads with offset > 0"), don't play nicely
> > > > with
> > > > the masking optimization - commit 239946314e57 ("bpf: possibly
> > > > avoid
> > > > extra masking for narrower load in verifier"). In particular,
> > > > when
> > > > we
> > > > suppress extra masking, we suppress shifting as well, which is
> > > > not
> > > > correct.
> > > > 
> > > > Fix by moving the masking suppression check to BPF_AND
> > > > generation.
> > > > 
> > > > Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0")
> > > > Signed-off-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx>
> > > > ---
> > > >  kernel/bpf/verifier.c | 14 +++++++++-----
> > > >  1 file changed, 9 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index d7473fee247c..195f2e9b5a47 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct
> > > > bpf_verifier_env *env)
> > > >                         return -EINVAL;
> > > >                 }
> > > >  
> > > > -               if (is_narrower_load && size < target_size) {
> > > > +               if (is_narrower_load) {
> > > >                         u8 shift =
> > > > bpf_ctx_narrow_access_offset(
> > > >                                 off, size, size_default) * 8;
> > > >                         if (shift && cnt + 1 >=
> > > > ARRAY_SIZE(insn_buf)) {
> > > > @@ -12860,15 +12860,19 @@ static int
> > > > convert_ctx_accesses(struct
> > > > bpf_verifier_env *env)
> > > >                                         insn_buf[cnt++] =
> > > > BPF_ALU32_IMM(BPF_RSH,
> > > >                                                                
> > > >     
> > > >      insn->dst_reg,
> > > >                                                                
> > > >     
> > > >      shift);
> > > > -                               insn_buf[cnt++] =
> > > > BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
> > > > -
> > > >                                                                (
> > > > 1
> > > > << size * 8) - 1);
> > > > +                               if (size < target_size)
> > > > +                                       insn_buf[cnt++] =
> > > > BPF_ALU32_IMM(
> > > > +                                               BPF_AND, insn-
> > > > > dst_reg,
> > > > +                                               (1 << size * 8)
> > > > -
> > > > 1);
> > > >                         } else {
> > > >                                 if (shift)
> > > >                                         insn_buf[cnt++] =
> > > > BPF_ALU64_IMM(BPF_RSH,
> > > >                                                                
> > > >     
> > > >      insn->dst_reg,
> > > >                                                                
> > > >     
> > > >      shift);
> > > > -                               insn_buf[cnt++] =
> > > > BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
> > > > -
> > > >                                                                
> > > > (1ULL
> > > >  << size * 8) - 1);
> > > > +                               if (size < target_size)
> > > > +                                       insn_buf[cnt++] =
> > > > BPF_ALU64_IMM(
> > > > +                                               BPF_AND, insn-
> > > > > dst_reg,
> > > > +                                               (1ULL << size *
> > > > 8)
> > > > - 1);
> > > >                         }
> > > >                 }
> > > 
> > > Thanks for patience. I'm coming back to this.
> > > 
> > > This fix affects the 2-byte load from bpf_sk_lookup.remote_port.
> > > Dumping the xlated BPF code confirms it.
> > > 
> > > On LE (x86-64) things look well.
> > > 
> > > Before this patch:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (b7) r0 = 0
> > >    2: (95) exit
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (b7) r0 = 0
> > >    2: (95) exit
> > > 
> > > After this patch:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (b7) r0 = 0
> > >    2: (95) exit
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (74) w2 >>= 16
> > >    2: (b7) r0 = 0
> > >    3: (95) exit
> > > 
> > > Which works great because the JIT generates a zero-extended load
> > > movzwq:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > > bpf_prog_5e4fe3dbdcb18fd3:
> > >    0:   nopl   0x0(%rax,%rax,1)
> > >    5:   xchg   %ax,%ax
> > >    7:   push   %rbp
> > >    8:   mov    %rsp,%rbp
> > >    b:   movzwq 0x4(%rdi),%rsi
> > >   10:   xor    %eax,%eax
> > >   12:   leave
> > >   13:   ret
> > > 
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > > bpf_prog_4a6336c64a340b96:
> > >    0:   nopl   0x0(%rax,%rax,1)
> > >    5:   xchg   %ax,%ax
> > >    7:   push   %rbp
> > >    8:   mov    %rsp,%rbp
> > >    b:   movzwq 0x4(%rdi),%rsi
> > >   10:   shr    $0x10,%esi
> > >   13:   xor    %eax,%eax
> > >   15:   leave
> > >   16:   ret
> > > 
> > > Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes
> > > of
> > > zero padding following it, like below, pass with flying colors:
> > > 
> > >         ok = ctx->remote_port == bpf_htons(8008);
> > >         if (!ok)
> > >                 return SK_DROP;
> > >         ok = *((__u16 *)&ctx->remote_port + 1) == 0;
> > >         if (!ok)
> > >                 return SK_DROP;
> > > 
> > > (The above checks compile to half-word (2-byte) loads.)
> > > 
> > > 
> > > On BE (s390x) things look different:
> > > 
> > > Before the patch:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (bc) w2 = w2
> > >    2: (b7) r0 = 0
> > >    3: (95) exit
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (bc) w2 = w2
> > >    2: (b7) r0 = 0
> > >    3: (95) exit
> > > 
> > > After the patch:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (bc) w2 = w2
> > >    2: (74) w2 >>= 16
> > >    3: (bc) w2 = w2
> > >    4: (b7) r0 = 0
> > >    5: (95) exit
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > >    0: (69) r2 = *(u16 *)(r1 +4)
> > >    1: (bc) w2 = w2
> > >    2: (b7) r0 = 0
> > >    3: (95) exit
> > > 
> > > These compile to:
> > > 
> > > * size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
> > > bpf_prog_fdd58b8caca29f00:
> > >    0:   j       0x0000000000000006
> > >    4:   nopr
> > >    6:   stmg    %r11,%r15,112(%r15)
> > >    c:   la      %r13,64(%r15)
> > >   10:   aghi    %r15,-96
> > >   14:   llgh    %r3,4(%r2,%r0)
> > >   1a:   srl     %r3,16
> > >   1e:   llgfr   %r3,%r3
> > >   22:   lgfi    %r14,0
> > >   28:   lgr     %r2,%r14
> > >   2c:   lmg     %r11,%r15,208(%r15)
> > >   32:   br      %r14
> > > 
> > > 
> > > * size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
> > > bpf_prog_5e3d8e92223c6841:
> > >    0:   j       0x0000000000000006
> > >    4:   nopr
> > >    6:   stmg    %r11,%r15,112(%r15)
> > >    c:   la      %r13,64(%r15)
> > >   10:   aghi    %r15,-96
> > >   14:   llgh    %r3,4(%r2,%r0)
> > >   1a:   lgfi    %r14,0
> > >   20:   lgr     %r2,%r14
> > >   24:   lmg     %r11,%r15,208(%r15)
> > >   2a:   br      %r14
> > > 
> > > Now, we right shift the value when loading
> > > 
> > >   *(u16 *)(r1 +36)
> > > 
> > > which in C BPF is equivalent to
> > > 
> > >   *((__u16 *)&ctx->remote_port + 0)
> > > 
> > > due to how the shift is calculated by
> > > bpf_ctx_narrow_access_offset().
> > 
> > Right, that's exactly the intention here.
> > The way I see the situation is: the ABI forces us to treat
> > remote_port
> > as a 32-bit field, even though the updated header now says
> > otherwise.
> > And this:
> > 
> >     unsigned int remote_port;
> >     unsigned short result = *(unsigned short *)remote_port;
> > 
> > should be the same as:
> > 
> >     unsigned short result = remote_port >> 16;
> > 
> > on big-endian. Note that this is inherently non-portable.
> 
> 
> 
> 
> 
> > 
> > > This makes the expected typical use-case
> > > 
> > >   ctx->remote_port == bpf_htons(8008)
> > > 
> > > fail on s390x because llgh (Load Logical Halfword (64<-16)) seems
> > > to
> > > lay
> > > out the data in the destination register so that it holds
> > > 0x0000_0000_0000_1f48.
> > > 
> > > I don't know that was the intention here, as it makes the BPF C
> > > code
> > > non-portable.
> > > 
> > > WDYT?
> > 
> > This depends on how we define the remote_port field. I would argue
> > that
> > the definition from patch 2 - even though ugly - is the correct
> > one.
> > It is consistent with both the little-endian (1f 48 00 00) and
> > big-endian (00 00 1f 48) ABIs.
> > 
> > I don't think the current definition is correct, because it expects
> > 1f 48 00 00 on big-endian, and this is not the case. We can verify
> > this
> > by taking 9a69e2^ and applying
> > 
> > --- a/tools/testing/selftests/bpf/progs/test_sk_lookup.c
> > +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup.c
> > @@ -417,6 +417,8 @@ int ctx_narrow_access(struct bpf_sk_lookup
> > *ctx)
> >                 return SK_DROP;
> >         if (LSW(ctx->remote_port, 0) != SRC_PORT)
> >                 return SK_DROP;
> > +       if (ctx->remote_port != SRC_PORT)
> > +               return SK_DROP;
> >  
> >         /* Narrow loads from local_port field. Expect DST_PORT. */
> >         if (LSB(ctx->local_port, 0) != ((DST_PORT >> 0) & 0xff) ||
> > 
> > Therefore that
> > 
> >   ctx->remote_port == bpf_htons(8008)
> > 
> > fails without patch 2 is as expected.
> > 
> 
> Consider this - today the below is true on both LE and BE, right?
> 
>   *(u32 *)&ctx->remote_port == *(u16 *)&ctx->remote_port
> 
> because the loads get converted to:
> 
>   *(u16 *)&ctx_kern->sport == *(u16 *)&ctx_kern->sport
> 
> IOW, today, because of the bug that you are fixing here, the data
> layout
> changes from the PoV of the BPF program depending on the load size.
> 
> With 2-byte loads, without this patch, the data layout appears as:
> 
>   struct bpf_sk_lookup {
>     ...
>     __be16 remote_port;
>     __be16 remote_port;
>     ...
>   }

I see, one can indeed argue that this is also a part of the ABI now.
So we're stuck between a rock and a hard place.

> While for 4-byte loads, it appears as in your 2nd patch:
> 
>   struct bpf_sk_lookup {
>     ...
>     #if little-endian
>     __be16 remote_port;
>     __u16  :16; /* zero padding */
>     #elif big-endian
>     __u16  :16; /* zero padding */
>     __be16 remote_port;
>     #endif
>     ...
>   }
> 
> Because of that I don't see how we could keep complete ABI
> compatiblity,
> and have just one definition of struct bpf_sk_lookup that reflects
> it. These are conflicting requirements.
> 
> I'd bite the bullet for 4-byte loads, for the sake of having an
> endian-agnostic struct bpf_sk_lookup and struct bpf_sock definition
> in
> the uAPI header.
> 
> The sacrifice here is that the access converter will have to keep
> rewriting 4-byte access to bpf_sk_lookup.remote_port and
> bpf_sock.dst_port in this unexpected, quirky manner.
> 
> The expectation is that with time users will recompile their BPF
> progs
> against the updated bpf.h, and switch to 2-byte loads. That will make
> the quirk in the access converter dead code in time.
> 
> I don't have any better ideas. Sorry.
> 
> [...]

I agree, let's go ahead with this solution.

The only remaining problem that I see is: the bug is in the common
code, and it will affect the fields that we add in the future.

Can we either document this state of things in a comment, or fix the
bug and emulate the old behavior for certain fields?