On 26/07/2023 01:03, Eduard Zingerman wrote: > On Tue, 2023-07-25 at 15:04 +0100, Alan Maguire wrote: >> On 25/07/2023 00:00, Alan Maguire wrote: >>> On 24/07/2023 16:04, Timofei Pushkin wrote: >>>> On Mon, Jul 24, 2023 at 3:36 PM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote: >>>>> >>>>> On 24/07/2023 11:32, Timofei Pushkin wrote: >>>>>> Dear BPF community, >>>>>> >>>>>> I'm developing a perf_event BPF program which reads some register >>>>>> values (frame and instruction pointers in particular) from the context >>>>>> provided to it. I found that CO-RE-enabled PT_REGS macros give results >>>>>> different from the results of the usual PT_REGS macros. I run the >>>>>> program on the same system I compiled it on, and so I cannot >>>>>> understand why the results differ and which ones should I use? >>>>>> >>>>>> From my tests, the results of the usual macros are the correct ones >>>>>> (e.g. I can symbolize the instruction pointers I get this way), but >>>>>> since I try to follow the CO-RE principle, it seems like I should be >>>>>> using the CO-RE-enabled variants instead. >>>>>> >>>>>> I did some experiments and found out that it is the >>>>>> bpf_probe_read_kernel part of the CO-RE-enabled PT_REGS macros that >>>>>> change the results and not __builtin_preserve_access_index. But I >>>>>> still don't get why exactly it changes the results. >>>>>> >>>>> >>>>> Can you provide the exact usage of the BPF CO-RE macros that isn't >>>>> working, and the equivalent non-CO-RE version that is? Also if you >>>> >>>> As a minimal example, I wrote the following little BPF program which >>>> prints instruction pointers obtained with non-CO-RE and CO-RE macros: >>>> >>>> volatile const pid_t target_pid; >>>> >>>> SEC("perf_event") >>>> int do_test(struct bpf_perf_event_data *ctx) { >>>> pid_t pid = bpf_get_current_pid_tgid(); >>>> if (pid != target_pid) return 0; >>>> >>>> unsigned long p = PT_REGS_IP(&ctx->regs); >>>> unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs); >>>> bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core); >>>> >>>> return 0; >>>> } >>>> >>>> From user space, I set the target PID and attach the program to CPU >>>> clock perf events (error checking and cleanup omitted for brevity): >>>> >>>> int main(int argc, char *argv[]) { >>>> // Load the program also setting the target PID >>>> struct test_program_bpf *skel = test_program_bpf__open(); >>>> skel->rodata->target_pid = (pid_t) strtol(argv[1], NULL, 10); >>>> test_program_bpf__load(skel); >>>> >>>> // Attach to perf events >>>> struct perf_event_attr attr = { >>>> .type = PERF_TYPE_SOFTWARE, >>>> .size = sizeof(struct perf_event_attr), >>>> .config = PERF_COUNT_SW_CPU_CLOCK, >>>> .sample_freq = 1, >>>> .freq = true >>>> }; >>>> for (int cpu_i = 0; cpu_i < libbpf_num_possible_cpus(); cpu_i++) { >>>> int perf_fd = syscall(SYS_perf_event_open, &attr, -1, cpu_i, -1, 0); >>>> bpf_program__attach_perf_event(skel->progs.do_test, perf_fd); >>>> } >>>> >>>> // Wait for Ctrl-C >>>> pause(); >>>> return 0; >>>> } >>>> >>>> As an experiment, I launched a simple C program with an endless loop >>>> in main and started the BPF program above with its target PID set to >>>> the PID of this simple C program. Then by checking the virtual memory >>>> mapped for the C program (with "cat /proc/<PID>/maps"), I found out >>>> that its .text section got mapped into 55ca2577b000-55ca2577c000 >>>> address space. When I checked the output of the BPF program, I got >>>> "non-CO-RE: 55ca2577b131, CO-RE: ffffa58810527e48". As you can see, >>>> the non-CO-RE result maps into the .text section of the launched C >>>> program (as it should since this is the value of the instruction >>>> pointer), while the CO-RE result does not. >>>> >>>> Alternatively, if I replace PT_REGS_IP and PT_REGS_IP_CORE with the >>>> equivalents for the stack pointer (PT_REGS_SP and PT_REGS_SP_CORE), I >>>> get results that correspond to the stack address space from the >>>> non-CO-RE macro, but I always get 0 from the CO-RE macro. >>>> >>>>> can provide details on the platform you're running on that will >>>>> help narrow down the issue. Thanks! >>>> >>>> Sure. I'm running Ubuntu 22.04.1, kernel version 5.19.0-46-generic, >>>> the architecture is x86_64, clang 14.0.0 is used to compile BPF >>>> programs with flags -g -O2 -D__TARGET_ARCH_x86. >>>> >>> >>> Thanks for the additional details! I've reproduced this on >>> bpf-next with LLVM 15; I'm seeing the same issues with the CO-RE >>> macros, and with BPF_CORE_READ(). However with extra libbpf debugging >>> I do see that we pick up the right type id/index for the ip field in >>> pt_regs: >>> >>> libbpf: prog 'do_test': relo #4: matching candidate #0 <byte_off> [216] >>> struct pt_regs.ip (0:16 @ offset 128) >>> >>> One thing I noticed - perhaps this will ring some bells for someone - >>> if I use __builtin_preserve_access_index() I get the same (correct) >>> value for ip as is retrieved with PT_REGS_IP(): >>> >>> __builtin_preserve_access_index(({ >>> p_core = ctx->regs.ip; >>> })); >>> >>> I'll check with latest LLVM to see if the issue persists there. >>> >> >> >> The problem occurs with latest bpf-next + latest LLVM too. Perf event >> programs fix up context accesses to the "struct bpf_perf_event_data *" >> context, so accessing ctx->regs in your program becomes accessing the >> "struct bpf_perf_event_data_kern *" regs, which is a pointer to >> struct pt_regs. So I _think_ that's why the >> >> __builtin_preserve_access_index(({ >> p_core = ctx->regs.ip; >> })); >> >> >> ...works; ctx->regs is fixed up to point at the right place, then >> CO-RE does its thing with the results. Contrast this with >> >> bpf_probe_read_kernel(&ip, sizeof(ip), &ctx->regs.ip); >> >> In the latter case, the fixups don't seem to happen and we get a >> bogus address which appears to be consistently 218 bytes after the ctx >> pointer. I've confirmed that a basic bpf_probe_read_kernel() >> exposes the issue (and gives the same wrong address as a CO-RE-wrapped >> bpf_probe_read_kernel()). >> >> I tried some permutations like defining >> >> struct pt_regs *regs = &ctx->regs; >> >> ...to see if that helps, but I think in that case the accesses aren't >> caught by the verifier because we use the & operator on the ctx->regs. >> >> Not sure how smart the verifier can be about context accesses like this; >> can someone who understands that code better than me take a look at this? > > Hi Alan, > > Your analysis is correct: verifier applies rewrites to instructions > that read/write from/to certain context fields, including > `struct bpf_perf_event_data`. > > This is done by function verifier.c:convert_ctx_accesses(). > This function handles BPF_LDX, BPF_STX and BPF_ST instructions, but it > does not handle calls to helpers like bpf_probe_read_kernel(). > > So, when code generated for PT_REGS_IP(&ctx->regs) is processed, the > correct access sequence is inserted by function > bpf_trace.c:pe_prog_convert_ctx_access() (see below). > > But code generated for `PT_REGS_IP_CORE(&ctx->regs)` is not modified. > Ah, makes sense. Would you consider it a bug that helper parameters don't get context conversions applied, or are there additional complexities here that mean that's not doable? (I'm wondering if we should fix versus document this?). I would have thought the only difference is the destination register, but the verifier is a mysterious land to me.. > It looks like `PT_REGS_IP_CORE` macro should not be defined through > bpf_probe_read_kernel(). I'll dig through commit history tomorrow to > understand why is it defined like that now. > help If I recall the rationale was to allow the macros to work for both BPF programs that can do direct dereference (fentry, fexit, tp_btf etc) and for kprobe-style that need to use bpf_probe_read_kernel(). Not sure if it would be worth having variants that are purely dereference-based, since we can just use PT_REGS_IP() due to the __builtin_preserve_access_index attributes applied in vmlinux.h. Thanks! Alan > Thanks, > Eduard > > --- > Below is annotated example, inpatient reader might skip it > > For the following test program: > > #include "vmlinux.h" > ... > SEC("perf_event") > int do_test(struct bpf_perf_event_data *ctx) { > unsigned long p = PT_REGS_IP(&ctx->regs); > unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs); > bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core); > return 0; > } > > Generated BPF code looks as follows: > > $ llvm-objdump --no-show-raw-insn -rd bpf.linked.o > ... > 0000000000000000 <do_test>: > # Third argument for bpf_probe_read_kernel: offset of bpf_perf_event_data::regs.ip > 0: r2 = 0x80 > 0000000000000000: CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16) > 1: r3 = r1 > 2: r3 += r2 > # The "non CO-RE" version of PT_REGS_IP is, in fact, CO-RE > # because `struct bpf_perf_event_data` has preserve_access_index > # tag in the vmlinux.h. > # Here the regs.ip is stored in r6 to be used after the call > # to bpf_probe_read_kernel() (from PT_REGS_IP_CORE). > 3: r6 = *(u64 *)(r1 + 0x80) > 0000000000000018: CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16) > # First argument for bpf_probe_read_kernel: a place on stack to put read result to. > 4: r1 = r10 > 5: r1 += -0x8 > # Second argument for bpf_probe_read_kernel: the size of the field to read. > 6: w2 = 0x8 > # Call to bpf_probe_read_kernel() > 7: call 0x71 > # Fourth parameter of bpf_printk: p_core read from stack > # (was written by call to bpf_probe_read_kernel) > 8: r4 = *(u64 *)(r10 - 0x8) > # First parameter of bpf_printk: control string > 9: r1 = 0x0 ll > 0000000000000048: R_BPF_64_64 .rodata > # Second parameter of bpf_printk: size of the control string > 11: w2 = 0x1b > # Third parameter of bpf_printk: p (see addr 3) > 12: r3 = r6 > # Call to bpf_printk > 13: call 0x6 > ; return 0; > 14: w0 = 0x0 > 15: exit > > I get the following BPF after all verifier rewrites are applied > (including verifier.c:convert_ctx_accesses()): > > # ./tools/bpf/bpftool/bpftool prog dump xlated id 114 > int do_test(struct bpf_perf_event_data * ctx): > ; int do_test(struct bpf_perf_event_data *ctx) { > 0: (b7) r2 = 128 | CO-RE replacement, 128 is a valid offset for > | bpf_perf_event_data::regs.ip in my kernel > 1: (bf) r3 = r1 > 2: (0f) r3 += r2 > > 3: (79) r6 = *(u64 *)(r1 +0) | This is an expantion of the > 4: (79) r6 = *(u64 *)(r6 +128) | r6 = *(u64 *)(r1 + 0x80) > 5: (bf) r1 = r10 | Created by bpf_trace.c:pe_prog_convert_ctx_access() > > 6: (07) r1 += -8 > 7: (b4) w2 = 8 > 8: (85) call bpf_probe_read_kernel#-91984 > 9: (79) r4 = *(u64 *)(r10 -8) > 10: (18) r1 = map[id:59][0]+0 > 12: (b4) w2 = 27 > 13: (bf) r3 = r6 > 14: (85) call bpf_trace_printk#-85520 > 15: (b4) w0 = 0 > 16: (95) exit > >