Re: BPF CO-RE and array fields in context struct

Yonghong Song <yhs@xxxxxx> · Tue, 23 Nov 2021 12:08:47 -0800

On 11/23/21 8:15 AM, YiFei Zhu wrote:
On Mon, Nov 22, 2021 at 4:24 PM Yonghong Song <yhs@xxxxxx> wrote:
On 11/22/21 12:44 PM, YiFei Zhu wrote:
On Mon, Nov 22, 2021 at 8:19 AM YiFei Zhu <zhuyifei@xxxxxxxxxx> wrote:

Hi

I've been investigating the use of BPF CO-RE. I discovered that if I
include vmlinux.h and have all structures annotated with
__attribute__((preserve_access_index)), including the context struct,
then a prog that accesses an array field in the context struct, in
some particular way, cannot pass the verifier.

A bunch of manual reduction plus creduce gives me this output:

    struct bpf_sock_ops {
      int family;
      int remote_ip6[];
    } __attribute__((preserve_access_index));
    __attribute__((section("sockops"))) int b(struct bpf_sock_ops *d) {
      int a = d->family;
      int *c = d->remote_ip6;
      c[2] = a;
      return 0;
    }

With Debian clang version 11.1.0-4+build1, this compiles to

    0000000000000000 <b>:
           0: b7 02 00 00 04 00 00 00 r2 = 4
           1: bf 13 00 00 00 00 00 00 r3 = r1
           2: 0f 23 00 00 00 00 00 00 r3 += r2
           3: 61 11 00 00 00 00 00 00 r1 = *(u32 *)(r1 + 0)
           4: 63 13 08 00 00 00 00 00 *(u32 *)(r3 + 8) = r1
           5: b7 00 00 00 00 00 00 00 r0 = 0
           6: 95 00 00 00 00 00 00 00 exit

And the prog will be rejected with this verifier log:

    ; __attribute__((section("sockops"))) int b(struct bpf_sock_ops *d) {
    0: (b7) r2 = 32
    1: (bf) r3 = r1
    2: (0f) r3 += r2
    last_idx 2 first_idx 0
    regs=4 stack=0 before 1: (bf) r3 = r1
    regs=4 stack=0 before 0: (b7) r2 = 32
    ; int a = d->family;
    3: (61) r1 = *(u32 *)(r1 +20)
    ; c[2] = a;
    4: (63) *(u32 *)(r3 +8) = r1
    dereference of modified ctx ptr R3 off=32 disallowed
    processed 5 insns (limit 1000000) max_states_per_insn 0 total_states
0 peak_states 0 mark_read 0

Looking at check_ctx_reg() and its callsite at check_mem_access() in
verifier.c, it seems that the verifier really does not like when the
context pointer has an offset, in this case the field offset of
d->remote_ip6.

I thought this is just an issue with array fields, that field offset
relocations may have trouble expressing two field accesses (one struct
member, one array memory). However, further testing reveals that this
is not the case, because if I simplify out the local variables, the
error is gone:

    struct bpf_sock_ops {
      int family;
      int remote_ip6[];
    } __attribute__((preserve_access_index));
    __attribute__((section("sockops"))) int b(struct bpf_sock_ops *d) {
      d->remote_ip6[2] = d->family;
      return 0;
    }

is compiled to:

    0000000000000000 <b>:
           0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
           1: 63 21 0c 00 00 00 00 00 *(u32 *)(r1 + 12) = r2
           2: b7 00 00 00 00 00 00 00 r0 = 0
           3: 95 00 00 00 00 00 00 00 exit

and is loaded as:

    ; d->remote_ip6[2] = d->family;
    0: (61) r2 = *(u32 *)(r1 +20)
    ; d->remote_ip6[2] = d->family;
    1: (63) *(u32 *)(r1 +40) = r2
    invalid bpf_context access off=40 size=4

I believe this error is because d->remote_ip6 is read-only, that this
modification might be more of a product of creduce, but we can see
that the CO-RE adjected offset of the array element from the context
pointer is correct: 32 to remote_ip6, 8 array index, so total offset
is 40.

Also note that removal of __attribute__((preserve_access_index)) from
the first (miscompiled) program produces exactly the same bytecode as
this new program (with no locals).

What is going on here? Why does the access of an array in context in
this particular way cause it to generate code that would not pass the
verifier? Is it a bug in Clang/LLVM, or is it the verifier being too
strict?

Additionally, testing the latest LLVM main branch, this test case,
which does not touch array fields at all, fails but succeeded with
clang version 11.1.0:

    struct bpf_sock_ops {
      int op;
      int bpf_sock_ops_cb_flags;
    } __attribute__((preserve_access_index));
    enum { a, b } static (*c)() = (void *)9;
    enum d { e } f;
    enum d g;
    __attribute__((section("sockops"))) int h(struct bpf_sock_ops *i) {
      switch (i->op) {
      case a:
        f = g = c(i, i->bpf_sock_ops_cb_flags);
        break;
      case b:
        f = g = c(i, i->bpf_sock_ops_cb_flags);
      }
      return 0;
    }

This is another issue which actually appears (even in bpf mailing list)
multiple times.

The following change should fix the issue:

   $ diff t1.c t1-good.c

--- t1.c        2021-11-22 16:00:13.915921544 -0800
+++ t1-good.c   2021-11-22 16:12:32.823710102 -0800
@@ -5,13 +5,14 @@
     enum { a, b } static (*c)() = (void *)9;
     enum d { e } f;
     enum d g;
+  #define __barrier asm volatile("" ::: "memory")
     __attribute__((section("sockops"))) int h(struct bpf_sock_ops *i) {
       switch (i->op) {
       case a:
-      f = g = c(i, i->bpf_sock_ops_cb_flags);
+      f = g = c(i, i->bpf_sock_ops_cb_flags); __barrier;
         break;
       case b:
-      f = g = c(i, i->bpf_sock_ops_cb_flags);
+      f = g = c(i, i->bpf_sock_ops_cb_flags); __barrier;
       }
       return 0;
     }

Basically add a compiler barrier at the end of case statements
to prevent common code sinking.

In the above case, for the original code, latest compiler did an
optimization like
       case a:
           tmp = reloc_offset(i->bpf_sock_ops_cb_flags);
       case b:
           tmp = reloc_offset(i->bpf_sock_ops_cb_flags);
     common:
       val = load r1, tmp
       ...

Note that reloc_offset is not sinked to the common code
due to its special internal representation.

To avoid such a code generation, add compiler barrier to
the end of case statement to prevent load sinking in which case
we will have
      val = load r1, reloc_offset(...)
and verifier will be happy about this.

The commit you listed below had a big change which may enable
the above compiler optimization and llvm11 may just not do
the code sinking optimization in this particular instance.

I guess the compiler could still enforce this. But since it does
not know whether the memory access is for context or not, doing
so might hurt performance. But any way, this has appeared multiple
times internally and also in the mailing list. We will take a further
look.

I see, thanks for the explanations. What is interesting is that prior
to that commit reloc_offset(i->bpf_sock_ops_cb_flags) is generated
only once. The disassembly matches that of
     case a:
     case b:
           tmp = reloc_offset(i->bpf_sock_ops_cb_flags);
           val = load r1, tmp

Whereas with the compiler barriers, both compilers generate (no common code):

   0000000000000000 <h>:
          0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
          1: 15 02 0a 00 01 00 00 00 if r2 == 1 goto +10 <LBB0_3>
          2: 55 02 11 00 00 00 00 00 if r2 != 0 goto +17 <LBB0_4>
          3: 61 12 04 00 00 00 00 00 r2 = *(u32 *)(r1 + 4)
          4: 85 00 00 00 09 00 00 00 call 9
          5: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
          7: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0
          8: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
         10: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0
         11: 05 00 08 00 00 00 00 00 goto +8 <LBB0_4>

   0000000000000060 <LBB0_3>:
         12: 61 12 04 00 00 00 00 00 r2 = *(u32 *)(r1 + 4)
         13: 85 00 00 00 09 00 00 00 call 9
         14: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
         16: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0
         17: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
         19: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0

   00000000000000a0 <LBB0_4>:
         20: b7 00 00 00 00 00 00 00 r0 = 0
         21: 95 00 00 00 00 00 00 00 exit

Did the linked commit create the special internal representation so
that they cannot be common code sinked, or is there some other issue
going on, or am I misunderstanding something?

Yes, the linked commit added a special builtin with additional
ever-increasing argument to prevent reloc_offset from sinking.
This is to ensure the relocation related codes won't be separated into
different basic blocks. But this won't be able to prevent the issue
you described in the above.


Thanks
YiFei Zhu
The bad code generation of latest LLVM:

    0000000000000000 <h>:
           0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
           1: 15 02 01 00 01 00 00 00 if r2 == 1 goto +1 <LBB0_2>
           2: 55 02 0b 00 00 00 00 00 if r2 != 0 goto +11 <LBB0_3>

    0000000000000018 <LBB0_2>:
           3: b7 03 00 00 04 00 00 00 r3 = 4
           4: bf 12 00 00 00 00 00 00 r2 = r1
           5: 0f 32 00 00 00 00 00 00 r2 += r3
           6: 61 22 00 00 00 00 00 00 r2 = *(u32 *)(r2 + 0)
           7: 85 00 00 00 09 00 00 00 call 9
           8: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
          10: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0
          11: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
          13: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0

    0000000000000070 <LBB0_3>:
          14: b7 00 00 00 00 00 00 00 r0 = 0
          15: 95 00 00 00 00 00 00 00 exit

The good code generation of LLVM 11.1.0:

    0000000000000000 <h>:
           0: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0)
           1: 25 02 08 00 01 00 00 00 if r2 > 1 goto +8 <LBB0_2>
           2: 61 12 04 00 00 00 00 00 r2 = *(u32 *)(r1 + 4)
           3: 85 00 00 00 09 00 00 00 call 9
           4: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
           6: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0
           7: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
           9: 63 01 00 00 00 00 00 00 *(u32 *)(r1 + 0) = r0

    0000000000000050 <LBB0_2>:
          10: b7 00 00 00 00 00 00 00 r0 = 0
          11: 95 00 00 00 00 00 00 00 exit

A bisect points me to this commit in LLVM as the first bad commit:

    commit 54d9f743c8b0f501288119123cf1828bf7ade69c
    Author: Yonghong Song <yhs@xxxxxx>
    Date:   Wed Sep 2 22:56:41 2020 -0700

        BPF: move AbstractMemberAccess and PreserveDIType passes to
EP_EarlyAsPossible

        Move abstractMemberAccess and PreserveDIType passes as early as
        possible, right after clang code generation.

    [...]

        Differential Revision: https://reviews.llvm.org/D87153

YiFei Zhu