在 2024/8/13 1:49, Andrii Nakryiko 写道: > On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: >> >> >> >> 在 2024/8/9 2:26, Andrii Nakryiko 写道: >>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: >>>> >>>> Hi Andrii and Oleg. >>>> >>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe >>>> on arm64. I notice recent discussions on the performance and scalability of uprobes >>>> within the mailing list. Considering this interest, I've added you and other relevant >>>> maintainers to the CC list for broader visibility and potential collaboration. >>>> >>> >>> Hi Liao, >>> >>> As you can see there is an active work to improve uprobes, that >>> changes lifetime management of uprobes, removes a bunch of locks taken >>> in the uprobe/uretprobe hot path, etc. It would be nice if you can >>> hold off a bit with your changes until all that lands. And then >>> re-benchmark, as costs might shift. >>> >>> But also see some remarks below. >>> >>>> Thanks. >>>> >>>> 在 2024/7/27 17:44, Liao Chang 写道: >>>>> The profiling result of single-thread model of selftests bench reveals >>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on >>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by >>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take >>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push. >>>>> >>>>> This patch introduce struct uprobe_breakpoint to track previously >>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the >>>>> need for redundant insn_slot writes and subsequent expensive cache >>>>> flush, especially on architecture like ARM64. This patch has been tested >>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest >>>>> bench and Redis GET/SET benchmark result below reveal obivious >>>>> performance gain. >>>>> >>>>> before-opt >>>>> ---------- >>>>> trig-uprobe-nop: 0.371 ± 0.001M/s (0.371M/prod) >>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod) >>>>> trig-uprobe-ret: 1.637 ± 0.001M/s (1.647M/prod) >>> >>> I'm surprised that nop and push variants are much slower than ret >>> variant. This is exactly opposite on x86-64. Do you have an >>> explanation why this might be happening? I see you are trying to >>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow >>> variant of uprobe that normally shouldn't be used. Typically uprobe is >>> installed on nop (for USDT) and on function entry (which would be push >>> variant, `push %rbp` instruction). >>> >>> ret variant, for x86-64, causes one extra step to go back to user >>> space to execute original instruction out-of-line, and then trapping >>> back to kernel for running uprobe. Which is what you normally want to >>> avoid. >>> >>> What I'm getting at here. It seems like maybe arm arch is missing fast >>> emulated implementations for nops/push or whatever equivalents for >>> ARM64 that is. Please take a look at that and see why those are slow >>> and whether you can make those into fast uprobe cases? >> >> Hi Andrii, >> >> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive >> compared to X86 behavior. My investigation revealed that the root cause lies in >> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions >> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel >> to handle these instructions out-of-line in userspace upon breakpoint exception >> is handled, leading to a significant performance overhead compared to 'ret' variant, >> which is already emulated. >> >> To address this issue, I've developed a patch supports the emulation of 'nop' and >> 'push' variants. The benchmark results below indicates the performance gain of >> emulation is obivious. >> >> xol (1 cpus) >> ------------ >> uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) >> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) >> uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) >> uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) >> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) >> uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) >> >> emulation (1 cpus) >> ------------------- >> uprobe-nop: 1.862 ± 0.002M/s (1.862M/s/cpu) >> uprobe-push: 1.743 ± 0.006M/s (1.743M/s/cpu) >> uprobe-ret: 1.840 ± 0.001M/s (1.840M/s/cpu) >> uretprobe-nop: 0.964 ± 0.004M/s (0.964M/s/cpu) >> uretprobe-push: 0.936 ± 0.004M/s (0.936M/s/cpu) >> uretprobe-ret: 0.940 ± 0.001M/s (0.940M/s/cpu) >> >> As you can see, the performance gap between nop/push and ret variants has been significantly >> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent >> more cycles than the other. > > Great, it's an obvious improvement. Are you going to send patches > upstream? Please cc bpf@xxxxxxxxxxxxxxx as well. I'll need more time to thoroughly test this patch. The emulation o push/nop instructions also impacts the kprobe/kretprobe paths on Arm64, As as result, I'm working on enhancements to trig-kprobe/kretprobe to prevent performance regression. > > > I'm also thinking we should update uprobe/uretprobe benchmarks to be > less x86-specific. Right now "-nop" is the happy fastest case, "-push" > is still happy, slightly slower case (due to the need to emulate stack > operation) and "-ret" is meant to be the slow single-step case. We > should adjust the naming and make sure that on ARM64 we hit similar > code paths. Given you seem to know arm64 pretty well, can you please > take a look at updating bench tool for ARM64 (we can also rename > benchmarks to something a bit more generic, rather than using > instruction names)? Let me use a matrix below for the structured comparsion of uprobe/uretprobe benchmarks on X86 and Arm64: Architecture Instrution Type Handling method Performance X86 nop Emulated Fastest X86 push Emulated Fast X86 ret Single-step Slow Arm64 nop Emulated Fastest Arm64 push Emulated Fast Arm64 ret Emulated Faster I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss' for 'single-steppable' instructions. Generally, emulated instructions should outperform single-step ones across different architectures. Regarding the generic naming, I propose using a self-explanatory style, such as s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g. Above all, example "bench --list" output: X86: ... trig-uprobe-emu-empty-insn trig-uprobe-ss-func-return trig-uprobe-emu-push-stack trig-uretprobe-emu-empyt-insn trig-uretprobe-ss-func-return trig-uretprobe-emu-push-stack ... Arm64: ... trig-uprobe-emu-empty-insn trig-uprobe-emu-func-return trig-uprobe-emu-push-stack trig-uretprobe-emu-empyt-insn trig-uretprobe-emu-func-return trig-uretprobe-emu-push-stack ... This structure will allow for direct comparison of uprobe/uretprobe performance across different architectures and instruction types. Please let me know your thought, Andrii. Thanks. > >> >>> >>>>> trig-uretprobe-nop: 0.331 ± 0.004M/s (0.331M/prod) >>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod) >>>>> trig-uretprobe-ret: 0.854 ± 0.002M/s (0.854M/prod) >>>>> Redis SET (RPS) uprobe: 42728.52 >>>>> Redis GET (RPS) uprobe: 43640.18 >>>>> Redis SET (RPS) uretprobe: 40624.54 >>>>> Redis GET (RPS) uretprobe: 41180.56 >>>>> >>>>> after-opt >>>>> --------- >>>>> trig-uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) >>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) >>>>> trig-uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) >>>>> trig-uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) >>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) >>>>> trig-uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) >>>>> Redis SET (RPS) uprobe: 43939.69 >>>>> Redis GET (RPS) uprobe: 45200.80 >>>>> Redis SET (RPS) uretprobe: 41658.58 >>>>> Redis GET (RPS) uretprobe: 42805.80 >>>>> >>>>> While some uprobes might still need to share the same insn_slot, this >>>>> patch compare the instructions in the resued insn_slot with the >>>>> instructions execute out-of-line firstly to decides allocate a new one >>>>> or not. >>>>> >>>>> Additionally, this patch use a rbtree associated with each thread that >>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the >>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less >>>>> contention, it result in faster lookup times compared to find_uprobe(). >>>>> >>>>> The other part of this patch are some necessary memory management for >>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly >>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All >>>>> uprobe_breakpoints will be freed when thread exit. >>>>> >>>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> >>>>> --- >>>>> include/linux/uprobes.h | 3 + >>>>> kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++------- >>>>> 2 files changed, 211 insertions(+), 38 deletions(-) >>>>> >>> >>> [...] >> >> -- >> BR >> Liao, Chang -- BR Liao, Chang