On 12/11/2024 17:07, Yonghong Song wrote: > > > > On 11/12/24 8:56 AM, Alan Maguire wrote: >> On 12/11/2024 01:51, Yonghong Song wrote: >>> >>> >>> On 11/11/24 7:39 AM, Alan Maguire wrote: >>>> On 08/11/2024 18:05, Yonghong Song wrote: >>>>> Song Liu reported that a kernel func (perf_event_read()) cannot be >>>>> traced >>>>> in certain situations since the func is not in vmlinux bTF. This >>>>> happens >>>>> in kernels 6.4, 6.9 and 6.11 and the kernel is built with pahole 1.27. >>>>> >>>>> The perf_event_read() signature in kernel (kernel/events/core.c): >>>>> static int perf_event_read(struct perf_event *event, bool group) >>>>> >>>>> Adding '-V' to pahole command line, and the following error msg can >>>>> be found: >>>>> skipping addition of 'perf_event_read'(perf_event_read) due to >>>>> unexpected register used for parameter >>>>> >>>>> Eventually the error message is attributed to the setting >>>>> (parm->unexpected_reg = 1) in parameter__new() function. >>>>> >>>>> The following is the dwarf representation for perf_event_read(): >>>>> 0x0334c034: DW_TAG_subprogram >>>>> DW_AT_low_pc (0xffffffff812c6110) >>>>> DW_AT_high_pc (0xffffffff812c640a) >>>>> DW_AT_frame_base (DW_OP_reg7 RSP) >>>>> DW_AT_GNU_all_call_sites (true) >>>>> DW_AT_name ("perf_event_read") >>>>> DW_AT_decl_file ("/rw/compile/kernel/events/core.c") >>>>> DW_AT_decl_line (4641) >>>>> DW_AT_prototyped (true) >>>>> DW_AT_type (0x03324f6a "int") >>>>> 0x0334c04e: DW_TAG_formal_parameter >>>>> DW_AT_location (0x007de9fd: >>>>> [0xffffffff812c6115, 0xffffffff812c6141): >>>>> DW_OP_reg5 RDI >>>>> [0xffffffff812c6141, 0xffffffff812c6323): >>>>> DW_OP_reg14 R14 >>>>> [0xffffffff812c6323, 0xffffffff812c63fe): >>>>> DW_OP_GNU_entry_value(DW_OP_reg5 RDI), DW_OP_stack_value >>>>> [0xffffffff812c63fe, 0xffffffff812c6405): >>>>> DW_OP_reg14 R14 >>>>> [0xffffffff812c6405, 0xffffffff812c640a): >>>>> DW_OP_GNU_entry_value(DW_OP_reg5 RDI), DW_OP_stack_value) >>>>> DW_AT_name ("event") >>>>> DW_AT_decl_file ("/rw/compile/kernel/events/ >>>>> core.c") >>>>> DW_AT_decl_line (4641) >>>>> DW_AT_type (0x0333aac2 "perf_event *") >>>>> 0x0334c05e: DW_TAG_formal_parameter >>>>> DW_AT_location (0x007dea82: >>>>> [0xffffffff812c6137, 0xffffffff812c63f2): >>>>> DW_OP_reg12 R12 >>>>> [0xffffffff812c63f2, 0xffffffff812c63fe): >>>>> DW_OP_GNU_entry_value(DW_OP_reg4 RSI), DW_OP_stack_value >>>>> [0xffffffff812c63fe, 0xffffffff812c640a): >>>>> DW_OP_reg12 R12) >>>>> DW_AT_name ("group") >>>>> DW_AT_decl_file ("/rw/compile/kernel/events/ >>>>> core.c") >>>>> DW_AT_decl_line (4641) >>>>> DW_AT_type (0x03327059 "bool") >>>>> >>>>> By inspecting the binary, the second argument ("bool group") is used >>>>> in the function. The following are the disasm code: >>>>> ffffffff812c6110 <perf_event_read>: >>>>> ffffffff812c6110: 0f 1f 44 00 00 nopl (%rax,%rax) >>>>> ffffffff812c6115: 55 pushq %rbp >>>>> ffffffff812c6116: 41 57 pushq %r15 >>>>> ffffffff812c6118: 41 56 pushq %r14 >>>>> ffffffff812c611a: 41 55 pushq %r13 >>>>> ffffffff812c611c: 41 54 pushq %r12 >>>>> ffffffff812c611e: 53 pushq %rbx >>>>> ffffffff812c611f: 48 83 ec 18 subq $24, %rsp >>>>> ffffffff812c6123: 41 89 f4 movl %esi, %r12d >>>>> <=========== NOTE that here '%esi' is used and moved to '%r12d'. >>>>> ffffffff812c6126: 49 89 fe movq %rdi, %r14 >>>>> ffffffff812c6129: 65 48 8b 04 25 28 00 00 00 movq %gs:40, >>>>> %rax >>>>> ffffffff812c6132: 48 89 44 24 10 movq %rax, 16(%rsp) >>>>> ffffffff812c6137: 8b af a8 00 00 00 movl 168(%rdi), %ebp >>>>> ffffffff812c613d: 85 ed testl %ebp, %ebp >>>>> ffffffff812c613f: 75 3f jne >>>>> 0xffffffff812c6180 <perf_event_read+0x70> >>>>> ffffffff812c6141: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs: >>>>> (%rax,%rax) >>>>> ffffffff812c614b: 0f 1f 44 00 00 nopl (%rax,%rax) >>>>> ffffffff812c6150: 49 8b 9e 28 02 00 00 movq 552(%r14), %rbx >>>>> ffffffff812c6157: 48 89 df movq %rbx, %rdi >>>>> ffffffff812c615a: e8 c1 a0 d7 00 callq >>>>> 0xffffffff82040220 <_raw_spin_lock_irqsave> >>>>> ffffffff812c615f: 49 89 c7 movq %rax, %r15 >>>>> ffffffff812c6162: 41 8b ae a8 00 00 00 movl 168(%r14), %ebp >>>>> ffffffff812c6169: 85 ed testl %ebp, %ebp >>>>> ffffffff812c616b: 0f 84 9a 00 00 00 je >>>>> 0xffffffff812c620b <perf_event_read+0xfb> >>>>> ffffffff812c6171: 48 89 df movq %rbx, %rdi >>>>> ffffffff812c6174: 4c 89 fe movq %r15, %rsi >>>>> <=========== NOTE: %rsi is overwritten >>>>> ...... >>>>> ffffffff812c63f0: 41 5c popq %r12 >>>>> <============ POP r12 >>>>> ffffffff812c63f2: 41 5d popq %r13 >>>>> ffffffff812c63f4: 41 5e popq %r14 >>>>> ffffffff812c63f6: 41 5f popq %r15 >>>>> ffffffff812c63f8: 5d popq %rbp >>>>> ffffffff812c63f9: e9 e2 a8 d7 00 jmp >>>>> 0xffffffff82040ce0 <__x86_return_thunk> >>>>> ffffffff812c63fe: 31 c0 xorl %eax, %eax >>>>> ffffffff812c6400: e9 be fe ff ff jmp >>>>> 0xffffffff812c62c3 <perf_event_read+0x1b3> >>>>> >>>>> It is not clear why dwarf didn't encode %rsi in locations. But >>>>> DW_OP_GNU_entry_value(DW_OP_reg4 RSI) tells us that RSI is live at >>>>> the entry of perf_event_read(). So this patch tries to search >>>>> DW_OP_GNU_entry_value/DW_OP_entry_value location/expression so if >>>>> the expected parameter register matchs the register in >>>>> DW_OP_GNU_entry_value/DW_OP_entry_value, then the original parameter >>>>> is not optimized. >>>>> >>>>> For one of internal 6.11 kernel, there are 62498 functions in BTF and >>>>> perf_event_read() is not there. With this patch, there are 61552 >>>>> functions >>>>> in BTF and perf_event_read() is included. >>>>> >>>> hi Yonghong, >>>> >>>> I'm confused by these numbers. I would have thought your changes would >>>> have led to a net increase of functions encoded in vmlinux BTF since we >>>> are now likely catching more cases where registers are expected. >>>> When I >>>> ran your patches against an LLVM-built kernel, that's what I saw; 70 >>>> additional functions were recognized as having expected parameters, and >>>> thus were encoded in BTF. In your case it looks like we lost nearly >>>> 1000 >>>> functions. Any idea what's going on there? If you can share your >>>> config, >>>> LLVM version I can dig into this from my side too. Thanks! >>> Attached is my config (based on one of meta internal configs). I tried >>> with master branch with head: >>> >>> 7b6e5bfa2541380b478ea1532880210ea3e39e11 (HEAD -> master, origin/master, >>> origin/HEAD) Merge branch 'refactor-lock-management' >>> ae6e3a273f590a2b64f14a9fab3546c3a8f44ed4 bpf: Drop special callback >>> reference handling >>> f6b9a69a9e56b2083aca8a925fc1a28eb698e3ed bpf: Refactor active lock >>> management >>> >>> I am using pahole v1.27. >>> >>> I am using an llvm built from upstream. The following is llvm-project >>> head: >>> beb12f92c71981670e07e47275efc6b5647011c1 (HEAD -> main) [RISCV] Add >>> +optimized-nfN-segment-load-store (#114414) >>> 6bad4514c938b3b48c0c719b8dd98b3906f2c290 [AArch64] Extend vector mull >>> test coverage. NFC >>> 915b910d800d7fab6a692294ff1d7075d8cba824 [libc] Fix typos in proxy type >>> headers (#114717) >>> 98ea1a81a28a6dd36941456c8ab4ce46f665f57a [IPO] Remove unused includes >>> (NFC) (#114716) >>> >>> With the above setup, when to do >>> >>> pahole -JV -- >>> btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs vmlinux >& log.pahole >>> >>> You will find the below info in the log: >>> skipping addition of 'perf_event_read'(perf_event_read) due to >>> unexpected register used for paramet >>> >>> In the dwarf: >>> >>> 0x02122746: DW_TAG_subprogram >>> DW_AT_low_pc (0xffffffff81299740) >>> DW_AT_high_pc (0xffffffff812999f7) >>> DW_AT_frame_base (DW_OP_reg7 RSP) >>> DW_AT_GNU_all_call_sites (true) >>> DW_AT_name ("perf_event_read") >>> DW_AT_decl_file ("/home/yhs/work/bpf-next/kernel/ >>> events/ >>> core.c") >>> DW_AT_decl_line (4746) >>> DW_AT_prototyped (true) >>> DW_AT_type (0x020f95f5 "int") >>> >>> 0x02122760: DW_TAG_formal_parameter >>> DW_AT_location (0x00769b72: >>> [0xffffffff81299745, 0xffffffff81299764): >>> DW_OP_reg5 RDI >>> [0xffffffff81299764, 0xffffffff81299937): >>> DW_OP_reg3 RBX >>> [0xffffffff81299937, 0xffffffff812999f0): >>> DW_OP_GNU_entry_value(DW_OP_reg5 RDI), DW_OP_stack_value >>> [0xffffffff812999f0, 0xffffffff812999f7): >>> DW_OP_reg3 RBX) >>> DW_AT_name ("event") >>> DW_AT_decl_file ("/home/yhs/work/bpf-next/ >>> kernel/events/core.c") >>> DW_AT_decl_line (4746) >>> DW_AT_type (0x0210f654 "perf_event *") >>> 0x02122770: DW_TAG_formal_parameter >>> DW_AT_location (0x00769c61: >>> [0xffffffff81299758, 0xffffffff81299926): >>> DW_OP_reg6 RBP >>> [0xffffffff81299926, 0xffffffff812999f0): >>> DW_OP_GNU_entry_value(DW_OP_reg4 RSI), DW_OP_stack_value >>> [0xffffffff812999f0, 0xffffffff812999f7): >>> DW_OP_reg6 RBP) >>> DW_AT_name ("group") >>> DW_AT_decl_file ("/home/yhs/work/bpf-next/ >>> kernel/events/core.c") >>> >>> The above is slightly different from our production kernel where Song >>> reported. But essence is the same. >>> The second parameter needs to check DW_OP_GNU_entry_value(DW_OP_reg4 >>> RSI) to ensure the second >>> argument is available. >>> >>> My patch is supposed to only make improvement. I am curiously why you >>> get less functions encoded in BTF. >>> >> Thanks for the config etc! When I build bpf-next using master branch >> llvm and this config, I see >> >> with baseline (master branch pahole): 62371 functions, no perf_event_read >> your series on top of master branch pahole: 62433 functions, >> perf_event_read present >> >> So that's consistent with what I've seen with other configs; more >> functions are present in vmlinux BTF since we are now seeing more cases >> where parameters are in fact consistent. The part that confuses me >> though is the numbers you initially reported above >> >> "for one of internal 6.11 kernel, there are 62498 functions in BTF and >> perf_event_read() is not there. With this patch, there are 61552 >> functions in BTF and perf_event_read() is included." >> >> These numbers suggest you lost nearly 1000 functions when building >> vmlinux BTF with pahole using this series. That's the part I don't >> understand - we should just see a gain in numbers of functions in >> vmlinux BTF, right? Did you mean 62552 functions rather than 61552 >> perhaps? > > Sorry, really embarrassing. it is typo. Indeed it should be 62552 functions > in BTF instead. > No problem, makes perfect sense now, thanks! I'm trying to reproduce the core dumps Eduard saw now with this setup; I'll report back if I manage to do so and see if locks as Jiri and Arnaldo suggested help. If so a v2 along the lines of Eduard's suggested change plus locking might be the best approach, what do you think? Thanks! Alan