Re: Question: CO-RE-enabled PT_REGS macros give strange results

Eduard Zingerman <eddyz87@xxxxxxxxx> · Wed, 26 Jul 2023 23:03:35 +0300

On Wed, 2023-07-26 at 14:46 +0100, Alan Maguire wrote:
> On 26/07/2023 01:03, Eduard Zingerman wrote:
> > On Tue, 2023-07-25 at 15:04 +0100, Alan Maguire wrote:
> > > On 25/07/2023 00:00, Alan Maguire wrote:
> > > > On 24/07/2023 16:04, Timofei Pushkin wrote:
> > > > > On Mon, Jul 24, 2023 at 3:36 PM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote:
> > > > > > 
> > > > > > On 24/07/2023 11:32, Timofei Pushkin wrote:
> > > > > > > Dear BPF community,
> > > > > > > 
> > > > > > > I'm developing a perf_event BPF program which reads some register
> > > > > > > values (frame and instruction pointers in particular) from the context
> > > > > > > provided to it. I found that CO-RE-enabled PT_REGS macros give results
> > > > > > > different from the results of the usual PT_REGS  macros. I run the
> > > > > > > program on the same system I compiled it on, and so I cannot
> > > > > > > understand why the results differ and which ones should I use?
> > > > > > > 
> > > > > > > From my tests, the results of the usual macros are the correct ones
> > > > > > > (e.g. I can symbolize the instruction pointers I get this way), but
> > > > > > > since I try to follow the CO-RE principle, it seems like I should be
> > > > > > > using the CO-RE-enabled variants instead.
> > > > > > > 
> > > > > > > I did some experiments and found out that it is the
> > > > > > > bpf_probe_read_kernel part of the CO-RE-enabled PT_REGS macros that
> > > > > > > change the results and not __builtin_preserve_access_index. But I
> > > > > > > still don't get why exactly it changes the results.
> > > > > > > 
> > > > > > 
> > > > > > Can you provide the exact usage of the BPF CO-RE macros that isn't
> > > > > > working, and the equivalent non-CO-RE version that is? Also if you
> > > > > 
> > > > > As a minimal example, I wrote the following little BPF program which
> > > > > prints instruction pointers obtained with non-CO-RE and CO-RE macros:
> > > > > 
> > > > > volatile const pid_t target_pid;
> > > > > 
> > > > > SEC("perf_event")
> > > > > int do_test(struct bpf_perf_event_data *ctx) {
> > > > >     pid_t pid = bpf_get_current_pid_tgid();
> > > > >     if (pid != target_pid) return 0;
> > > > > 
> > > > >     unsigned long p = PT_REGS_IP(&ctx->regs);
> > > > >     unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs);
> > > > >     bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core);
> > > > > 
> > > > >     return 0;
> > > > > }
> > > > > 
> > > > > From user space, I set the target PID and attach the program to CPU
> > > > > clock perf events (error checking and cleanup omitted for brevity):
> > > > > 
> > > > > int main(int argc, char *argv[]) {
> > > > >     // Load the program also setting the target PID
> > > > >     struct test_program_bpf *skel = test_program_bpf__open();
> > > > >     skel->rodata->target_pid = (pid_t) strtol(argv[1], NULL, 10);
> > > > >     test_program_bpf__load(skel);
> > > > > 
> > > > >     // Attach to perf events
> > > > >     struct perf_event_attr attr = {
> > > > >         .type = PERF_TYPE_SOFTWARE,
> > > > >         .size = sizeof(struct perf_event_attr),
> > > > >         .config = PERF_COUNT_SW_CPU_CLOCK,
> > > > >         .sample_freq = 1,
> > > > >         .freq = true
> > > > >     };
> > > > >     for (int cpu_i = 0; cpu_i < libbpf_num_possible_cpus(); cpu_i++) {
> > > > >         int perf_fd = syscall(SYS_perf_event_open, &attr, -1, cpu_i, -1, 0);
> > > > >         bpf_program__attach_perf_event(skel->progs.do_test, perf_fd);
> > > > >     }
> > > > > 
> > > > >     // Wait for Ctrl-C
> > > > >     pause();
> > > > >     return 0;
> > > > > }
> > > > > 
> > > > > As an experiment, I launched a simple C program with an endless loop
> > > > > in main and started the BPF program above with its target PID set to
> > > > > the PID of this simple C program. Then by checking the virtual memory
> > > > > mapped for the C program (with "cat /proc/<PID>/maps"), I found out
> > > > > that its .text section got mapped into 55ca2577b000-55ca2577c000
> > > > > address space. When I checked the output of the BPF program, I got
> > > > > "non-CO-RE: 55ca2577b131, CO-RE: ffffa58810527e48". As you can see,
> > > > > the non-CO-RE result maps into the .text section of the launched C
> > > > > program (as it should since this is the value of the instruction
> > > > > pointer), while the CO-RE result does not.
> > > > > 
> > > > > Alternatively, if I replace PT_REGS_IP and PT_REGS_IP_CORE with the
> > > > > equivalents for the stack pointer (PT_REGS_SP and PT_REGS_SP_CORE), I
> > > > > get results that correspond to the stack address space from the
> > > > > non-CO-RE macro, but I always get 0 from the CO-RE macro.
> > > > > 
> > > > > > can provide details on the platform you're running on that will
> > > > > > help narrow down the issue. Thanks!
> > > > > 
> > > > > Sure. I'm running Ubuntu 22.04.1, kernel version 5.19.0-46-generic,
> > > > > the architecture is x86_64, clang 14.0.0 is used to compile BPF
> > > > > programs with flags -g -O2 -D__TARGET_ARCH_x86.
> > > > > 
> > > > 
> > > > Thanks for the additional details! I've reproduced this on
> > > > bpf-next with LLVM 15; I'm seeing the same issues with the CO-RE
> > > > macros, and with BPF_CORE_READ(). However with extra libbpf debugging
> > > > I do see that we pick up the right type id/index for the ip field in
> > > > pt_regs:
> > > > 
> > > > libbpf: prog 'do_test': relo #4: matching candidate #0 <byte_off> [216]
> > > > struct pt_regs.ip (0:16 @ offset 128)
> > > > 
> > > > One thing I noticed - perhaps this will ring some bells for someone -
> > > > if I use __builtin_preserve_access_index() I get the same (correct)
> > > > value for ip as is retrieved with PT_REGS_IP():
> > > > 
> > > >     __builtin_preserve_access_index(({
> > > >         p_core = ctx->regs.ip;
> > > >     }));
> > > > 
> > > > I'll check with latest LLVM to see if the issue persists there.
> > > > 
> > > 
> > > 
> > > The problem occurs with latest bpf-next + latest LLVM too. Perf event
> > > programs fix up context accesses to the "struct bpf_perf_event_data *"
> > > context, so accessing ctx->regs in your program becomes accessing the
> > > "struct bpf_perf_event_data_kern *" regs, which is a pointer to
> > > struct pt_regs. So I _think_ that's why the
> > > 
> > >     __builtin_preserve_access_index(({
> > >         p_core = ctx->regs.ip;
> > >     }));
> > > 
> > > 
> > > ...works; ctx->regs is fixed up to point at the right place, then
> > > CO-RE does its thing with the results. Contrast this with
> > > 
> > > bpf_probe_read_kernel(&ip, sizeof(ip), &ctx->regs.ip);
> > > 
> > > In the latter case, the fixups don't seem to happen and we get a
> > > bogus address which appears to be consistently 218 bytes after the ctx
> > > pointer. I've confirmed that a basic bpf_probe_read_kernel()
> > > exposes the issue (and gives the same wrong address as a CO-RE-wrapped
> > > bpf_probe_read_kernel()).
> > > 
> > > I tried some permutations like defining
> > > 
> > >   struct pt_regs *regs = &ctx->regs;
> > > 
> > > ...to see if that helps, but I think in that case the accesses aren't
> > > caught by the verifier because we use the & operator on the ctx->regs.
> > > 
> > > Not sure how smart the verifier can be about context accesses like this;
> > > can someone who understands that code better than me take a look at this?
> > 
> > Hi Alan,
> > 
> > Your analysis is correct: verifier applies rewrites to instructions
> > that read/write from/to certain context fields, including
> > `struct bpf_perf_event_data`.
> > 
> > This is done by function verifier.c:convert_ctx_accesses().
> > This function handles BPF_LDX, BPF_STX and BPF_ST instructions, but it
> > does not handle calls to helpers like bpf_probe_read_kernel().
> > 
> > So, when code generated for PT_REGS_IP(&ctx->regs) is processed, the
> > correct access sequence is inserted by function
> > bpf_trace.c:pe_prog_convert_ctx_access() (see below).
> > 
> > But code generated for `PT_REGS_IP_CORE(&ctx->regs)` is not modified.
> > 
> 
> Ah, makes sense. Would you consider it a bug that helper parameters
> don't get context conversions applied, or are there additional
> complexities here that mean that's not doable? (I'm wondering if
> we should fix versus document this?). I would have thought the
> only difference is the destination register, but the verifier is
> a mysterious land to me..

Sorry for delay, had to read through implementation.

Verifier already reports an error when pointer to a context is passed
to helper function if context is subject to convert_ctx_accesses().
This is done in function verifier.c:check_helper_mem_access()
(see `case PTR_TO_CTX` at the end of the function).

For example, the following program is rejected:

  struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1);
    __type(key, void *);
    __type(value, int);
  } map SEC(".maps");

  SEC("perf_event")
  int do_test(struct bpf_perf_event_data *ctx) {
    void *p = bpf_map_lookup_elem(&map, &ctx->regs);
    return p ? 0 : 1;
  }

Here is the log:

  ...
  0: R1=ctx(off=0,imm=0) R10=fp0
  ; int do_test(struct bpf_perf_event_data *ctx) {
  0: (bf) r2 = r1                       ; R1=ctx(off=0,imm=0) R2_w=ctx(off=0,imm=0)
  1: (b7) r1 = 0                        ; R1_w=0
  2: (0f) r2 += r1
  mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1 
  mark_precise: frame0: regs=r1 stack= before 1: (b7) r1 = 0
  3: R1_w=0 R2_w=ctx(off=0,imm=0)
  ; void *p = bpf_map_lookup_elem(&map, &ctx->regs);
  3: (18) r1 = 0xffff888102c22000       ; R1_w=map_ptr(off=0,ks=8,vs=4,imm=0)
  5: (85) call bpf_map_lookup_elem#1
  R2 type=ctx expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf, trusted_ptr_
  ^^^^^^
     error message (a bit cryptic)
  ...

However, this error is checked conditionally depending on the
proto defined for the helper function. For example, here is the
proto for bpf_map_lookup_elem():

  const struct bpf_func_proto bpf_map_lookup_elem_proto = {
    .func       = bpf_map_lookup_elem,
    .gpl_only   = false,
    .pkt_access = true,
    .ret_type   = RET_PTR_TO_MAP_VALUE_OR_NULL,
    .arg1_type  = ARG_CONST_MAP_PTR,    // <-- check_helper_mem_access() is done
    .arg2_type  = ARG_PTR_TO_MAP_KEY,   // <-- check_helper_mem_access() is done
  };

And here is prototype for bpf_probe_read_kernel():

  const struct bpf_func_proto bpf_probe_read_kernel_proto = {
    .func       = bpf_probe_read_kernel,
    .gpl_only   = true,
    .ret_type   = RET_INTEGER,
    .arg1_type  = ARG_PTR_TO_UNINIT_MEM,   // <-- check_helper_mem_access() is done
    .arg2_type  = ARG_CONST_SIZE_OR_ZERO,  // <-- check_helper_mem_access() is done for
                                           //     the previous argument (a bit complicated)
    .arg3_type  = ARG_ANYTHING,            // <-- check_helper_mem_access() is *NOT* done
  };

So, the way bpf_probe_read_kernel() is defined for verifier it does
not check that last argument might point to "fake" location.

I think this should be fixed, but would be interesting to hear what
people on the mailing list think.

> > It looks like `PT_REGS_IP_CORE` macro should not be defined through
> > bpf_probe_read_kernel(). I'll dig through commit history tomorrow to
> > understand why is it defined like that now.
> >  help
> 
> If I recall the rationale was to allow the macros to work for both
> BPF programs that can do direct dereference (fentry, fexit, tp_btf etc)
> and for kprobe-style that need to use bpf_probe_read_kernel().
> Not sure if it would be worth having variants that are purely
> dereference-based, since we can just use PT_REGS_IP() due to
> the __builtin_preserve_access_index attributes applied in vmlinux.h.

Sorry, need a bit more time, thanks for the context.

> 
> Thanks!
> 
> Alan
> 
> > Thanks,
> > Eduard
> > 
> > ---
> > Below is annotated example, inpatient reader might skip it
> > 
> > For the following test program:
> > 
> >     #include "vmlinux.h"
> >     ...
> >     SEC("perf_event")
> >     int do_test(struct bpf_perf_event_data *ctx) {
> >       unsigned long p = PT_REGS_IP(&ctx->regs);
> >       unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs);
> >       bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core);
> >       return 0;
> >     }
> > 
> > Generated BPF code looks as follows:
> > 
> >     $ llvm-objdump --no-show-raw-insn -rd bpf.linked.o 
> >     ...
> >     0000000000000000 <do_test>:
> >     # Third argument for bpf_probe_read_kernel: offset of bpf_perf_event_data::regs.ip
> >            0:   r2 = 0x80
> >             0000000000000000:  CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16)
> >            1:   r3 = r1
> >            2:   r3 += r2
> >     # The "non CO-RE" version of PT_REGS_IP is, in fact, CO-RE
> >     # because `struct bpf_perf_event_data` has preserve_access_index
> >     # tag in the vmlinux.h.
> >     # Here the regs.ip is stored in r6 to be used after the call
> >     # to bpf_probe_read_kernel() (from PT_REGS_IP_CORE).
> >            3:   r6 = *(u64 *)(r1 + 0x80)
> >             0000000000000018:  CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16)
> >     # First argument for bpf_probe_read_kernel: a place on stack to put read result to.
> >            4:   r1 = r10
> >            5:   r1 += -0x8
> >     # Second argument for bpf_probe_read_kernel: the size of the field to read.
> >            6:   w2 = 0x8
> >     # Call to bpf_probe_read_kernel()
> >            7:   call 0x71
> >     # Fourth parameter of bpf_printk: p_core read from stack
> >     # (was written by call to bpf_probe_read_kernel)
> >            8:   r4 = *(u64 *)(r10 - 0x8)
> >     # First parameter of bpf_printk: control string
> >            9:   r1 = 0x0 ll
> >             0000000000000048:  R_BPF_64_64  .rodata
> >     # Second parameter of bpf_printk: size of the control string
> >           11:   w2 = 0x1b
> >     # Third parameter of bpf_printk: p (see addr 3)
> >           12:   r3 = r6
> >     # Call to bpf_printk
> >           13:   call 0x6
> >     ;   return 0;
> >           14:   w0 = 0x0
> >           15:   exit
> >     
> > I get the following BPF after all verifier rewrites are applied
> > (including verifier.c:convert_ctx_accesses()):
> > 
> >     # ./tools/bpf/bpftool/bpftool prog dump xlated id 114
> >     int do_test(struct bpf_perf_event_data * ctx):
> >     ; int do_test(struct bpf_perf_event_data *ctx) {
> >        0: (b7) r2 = 128                  | CO-RE replacement, 128 is a valid offset for
> >                                          | bpf_perf_event_data::regs.ip in my kernel
> >        1: (bf) r3 = r1
> >        2: (0f) r3 += r2
> > 
> >        3: (79) r6 = *(u64 *)(r1 +0)      | This is an expantion of the 
> >        4: (79) r6 = *(u64 *)(r6 +128)    |   r6 = *(u64 *)(r1 + 0x80)
> >        5: (bf) r1 = r10                  | Created by bpf_trace.c:pe_prog_convert_ctx_access()
> > 
> >        6: (07) r1 += -8
> >        7: (b4) w2 = 8
> >        8: (85) call bpf_probe_read_kernel#-91984
> >        9: (79) r4 = *(u64 *)(r10 -8)
> >       10: (18) r1 = map[id:59][0]+0
> >       12: (b4) w2 = 27
> >       13: (bf) r3 = r6
> >       14: (85) call bpf_trace_printk#-85520
> >       15: (b4) w0 = 0
> >       16: (95) exit
> >     
> >