Re: [PATCH v3 bpf-next 2/3] selftests/bpf: add variable-length data concatenation pattern test

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 23 Jun 2020 17:53:29 -0700

On Tue, Jun 23, 2020 at 5:25 PM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 6/23/20 4:25 PM, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2020 at 11:15:58PM +0200, Daniel Borkmann wrote:
> >> On 6/23/20 10:52 PM, Andrii Nakryiko wrote:
> >>> On Tue, Jun 23, 2020 at 1:39 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>> On 6/23/20 5:22 AM, Andrii Nakryiko wrote:
> >>>>> Add selftest that validates variable-length data reading and concatentation
> >>>>> with one big shared data array. This is a common pattern in production use for
> >>>>> monitoring and tracing applications, that potentially can read a lot of data,
> >>>>> but overall read much less. Such pattern allows to determine precisely what
> >>>>> amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.
> >>>>>
> >>>>> Signed-off-by: Andrii Nakryiko <andriin@xxxxxx>
> >>>>
> >>>> Currently getting the below errors on these tests. My last clang/llvm git build
> >>>> is on 4676cf444ea2 ("[Clang] Skip adding begin source location for PragmaLoopHint'd
> >>>> loop when[...]"):
> >>>
> >>> Yeah, you need 02553b91da5d ("bpf: bpf_probe_read_kernel_str() has to
> >>> return amount of data read on success") from bpf tree.
> >>
> >> Fair point, it's in net- but not yet in net-next tree, so bpf-next sync needs
> >> to wait.
> >>
> >>> I'm eagerly awaiting bpf being merged into bpf-next :)
> >>
> >> I'll cherry-pick 02553b91da5d locally for testing and if it passes I'll push
> >> these out.
> >
> > I've merged the bpf_probe_read_kernel_str() fix into bpf-next and 3 extra commits
> > prior to that one so that sha of the bpf_probe_read_kernel_str() fix (02553b91da5de)
> > is exactly the same in bpf/net/linus/bpf-next. I think that shouldn't cause
> > issue during bpf-next pull into net-next and later merge with Linus's tree.
> > Crossing fingers, since we're doing this experiment for the first time.
> >
> > Daniel pushed these 3 commits as well.
> > Now varlen and kernel_reloc tests are good, but we have a different issue :(
> > ./test_progs-no_alu32 -t get_stack_raw_tp
> > is now failing, but for a different reason.
> >
> > 52: (85) call bpf_get_stack#67
> > 53: (bf) r8 = r0
> > 54: (bf) r1 = r8
> > 55: (67) r1 <<= 32
> > 56: (c7) r1 s>>= 32
> > ; if (usize < 0)
> > 57: (c5) if r1 s< 0x0 goto pc+26
> >   R0=inv(id=0,smax_value=800) R1_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,smax_value=800) R9=inv800 R10=fp0 fp-8=mmmm????
> > ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> > 58: (1f) r9 -= r8
> > ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> > 59: (bf) r2 = r7
> > 60: (0f) r2 += r1
> > regs=1 stack=0 before 52: (85) call bpf_get_stack#67
> > ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> > 61: (bf) r1 = r6
> > 62: (bf) r3 = r9
> > 63: (b7) r4 = 0
> > 64: (85) call bpf_get_stack#67
> >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=800,var_off=(0x0; 0x3ff),s32_max_value=1023,u32_max_value=1023) R3_w=inv(id=0,umax_value=9223372036854776608) R4_w=inv0 R6=ctx(id=0?
> > R3 unbounded memory access, use 'var &= const' or 'if (var < const)'
> >
> > In the C code it was this:
> >          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> >          if (usize < 0)
> >                  return 0;
> >
> >          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >          if (ksize < 0)
> >                  return 0;
> >
> > We used to have problem with pointer arith in R2.
> > Now it's a problem with two integers in R3.
> > 'if (usize < 0)' is comparing R1 and makes it [0,800], but R8 stays [-inf,800].
> > Both registers represent the same 'usize' variable.
> > Then R9 -= R8 is doing 800 - [-inf, 800]
> > so the result of "max_len - usize" looks unbounded to the verifier while
> > it's obvious in C code that "max_len - usize" should be [0, 800].
> >
> > The following diff 'fixes' the issue for no_alu32:
> > diff --git a/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c b/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
> > index 29817a703984..93058136d608 100644
> > --- a/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
> > +++ b/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
> > @@ -2,6 +2,7 @@
> >
> >   #include <linux/bpf.h>
> >   #include <bpf/bpf_helpers.h>
> > +#define var_barrier(a) asm volatile ("" : "=r"(a) : "0"(a))
> >
> >   /* Permit pretty deep stack traces */
> >   #define MAX_STACK_RAWTP 100
> > @@ -84,10 +85,12 @@ int bpf_prog1(void *ctx)
> >                  return 0;
> >
> >          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> > +       var_barrier(usize);
> >          if (usize < 0)
> >                  return 0;
> >
> >          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> > +       var_barrier(ksize);
> >          if (ksize < 0)
> >                  return 0;
> >
> > But it breaks alu32 case.
> >
> > I'm using llvm 11 fwiw.
> >
> > Long term Yonghong is working on llvm support to emit this kind
> > of workarounds automatically.
> > I'm still thinking what to do next. Ideas?
>

Funny enough, Alexei's fix didn't fix even no_alu32 case for me. Also
have one of the latest clang 11...

> The following source change will make both alu32 and non-alu32 happy:
>
>   SEC("raw_tracepoint/sys_enter")
>   int bpf_prog1(void *ctx)
>   {
> -       int max_len, max_buildid_len, usize, ksize, total_size;
> +       int max_len, max_buildid_len, total_size;
> +       long usize, ksize;

This does fix it, both alu32 and no-alu32 pass.

>          struct stack_trace_t *data;
>          void *raw_data;
>          __u32 key = 0;
>
> I have not checked the reason why it works. Mostly this confirms to
> the function signature so compiler generates more friendly code.

Yes, it's due to the compiler not doing all the casting/bit shifting.
Just straightforward use of a single register consistently across
conditional jump and offset calculations.