On Thu, Aug 15, 2024 at 12:59 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > > > > 在 2024/8/15 0:57, Andrii Nakryiko 写道: > > On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > >> > >> > >> > >> 在 2024/8/13 1:49, Andrii Nakryiko 写道: > >>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道: > >>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> Hi Andrii and Oleg. > >>>>>> > >>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe > >>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes > >>>>>> within the mailing list. Considering this interest, I've added you and other relevant > >>>>>> maintainers to the CC list for broader visibility and potential collaboration. > >>>>>> > >>>>> > >>>>> Hi Liao, > >>>>> > >>>>> As you can see there is an active work to improve uprobes, that > >>>>> changes lifetime management of uprobes, removes a bunch of locks taken > >>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can > >>>>> hold off a bit with your changes until all that lands. And then > >>>>> re-benchmark, as costs might shift. > >>>>> > >>>>> But also see some remarks below. > >>>>> > >>>>>> Thanks. > >>>>>> > >>>>>> 在 2024/7/27 17:44, Liao Chang 写道: > >>>>>>> The profiling result of single-thread model of selftests bench reveals > >>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on > >>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by > >>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take > >>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push. > >>>>>>> > >>>>>>> This patch introduce struct uprobe_breakpoint to track previously > >>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the > >>>>>>> need for redundant insn_slot writes and subsequent expensive cache > >>>>>>> flush, especially on architecture like ARM64. This patch has been tested > >>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest > >>>>>>> bench and Redis GET/SET benchmark result below reveal obivious > >>>>>>> performance gain. > >>>>>>> > >>>>>>> before-opt > >>>>>>> ---------- > >>>>>>> trig-uprobe-nop: 0.371 ± 0.001M/s (0.371M/prod) > >>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod) > >>>>>>> trig-uprobe-ret: 1.637 ± 0.001M/s (1.647M/prod) > >>>>> > >>>>> I'm surprised that nop and push variants are much slower than ret > >>>>> variant. This is exactly opposite on x86-64. Do you have an > >>>>> explanation why this might be happening? I see you are trying to > >>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow > >>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is > >>>>> installed on nop (for USDT) and on function entry (which would be push > >>>>> variant, `push %rbp` instruction). > >>>>> > >>>>> ret variant, for x86-64, causes one extra step to go back to user > >>>>> space to execute original instruction out-of-line, and then trapping > >>>>> back to kernel for running uprobe. Which is what you normally want to > >>>>> avoid. > >>>>> > >>>>> What I'm getting at here. It seems like maybe arm arch is missing fast > >>>>> emulated implementations for nops/push or whatever equivalents for > >>>>> ARM64 that is. Please take a look at that and see why those are slow > >>>>> and whether you can make those into fast uprobe cases? > >>>> > >>>> Hi Andrii, > >>>> > >>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive > >>>> compared to X86 behavior. My investigation revealed that the root cause lies in > >>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions > >>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel > >>>> to handle these instructions out-of-line in userspace upon breakpoint exception > >>>> is handled, leading to a significant performance overhead compared to 'ret' variant, > >>>> which is already emulated. > >>>> > >>>> To address this issue, I've developed a patch supports the emulation of 'nop' and > >>>> 'push' variants. The benchmark results below indicates the performance gain of > >>>> emulation is obivious. > >>>> > >>>> xol (1 cpus) > >>>> ------------ > >>>> uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) > >>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) > >>>> uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) > >>>> uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) > >>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) > >>>> uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) > >>>> > >>>> emulation (1 cpus) > >>>> ------------------- > >>>> uprobe-nop: 1.862 ± 0.002M/s (1.862M/s/cpu) > >>>> uprobe-push: 1.743 ± 0.006M/s (1.743M/s/cpu) > >>>> uprobe-ret: 1.840 ± 0.001M/s (1.840M/s/cpu) > >>>> uretprobe-nop: 0.964 ± 0.004M/s (0.964M/s/cpu) > >>>> uretprobe-push: 0.936 ± 0.004M/s (0.936M/s/cpu) > >>>> uretprobe-ret: 0.940 ± 0.001M/s (0.940M/s/cpu) > >>>> > >>>> As you can see, the performance gap between nop/push and ret variants has been significantly > >>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent > >>>> more cycles than the other. > >>> > >>> Great, it's an obvious improvement. Are you going to send patches > >>> upstream? Please cc bpf@xxxxxxxxxxxxxxx as well. > >> > >> I'll need more time to thoroughly test this patch. The emulation o push/nop > >> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result, > >> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance > >> regression. > >> > >>> > >>> > >>> I'm also thinking we should update uprobe/uretprobe benchmarks to be > >>> less x86-specific. Right now "-nop" is the happy fastest case, "-push" > >>> is still happy, slightly slower case (due to the need to emulate stack > >>> operation) and "-ret" is meant to be the slow single-step case. We > >>> should adjust the naming and make sure that on ARM64 we hit similar > >>> code paths. Given you seem to know arm64 pretty well, can you please > >>> take a look at updating bench tool for ARM64 (we can also rename > >>> benchmarks to something a bit more generic, rather than using > >>> instruction names)? > >> > >> Let me use a matrix below for the structured comparsion of uprobe/uretprobe > >> benchmarks on X86 and Arm64: > >> > >> Architecture Instrution Type Handling method Performance > >> X86 nop Emulated Fastest > >> X86 push Emulated Fast > >> X86 ret Single-step Slow > >> Arm64 nop Emulated Fastest > >> Arm64 push Emulated Fast > >> Arm64 ret Emulated Faster > >> > >> I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss' > >> for 'single-steppable' instructions. Generally, emulated instructions should > >> outperform single-step ones across different architectures. Regarding the > >> generic naming, I propose using a self-explanatory style, such as > >> s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g. > >> > >> Above all, example "bench --list" output: > >> > >> X86: > >> ... > >> trig-uprobe-emu-empty-insn > >> trig-uprobe-ss-func-return > >> trig-uprobe-emu-push-stack > >> trig-uretprobe-emu-empyt-insn > >> trig-uretprobe-ss-func-return > >> trig-uretprobe-emu-push-stack > >> ... > >> > >> Arm64: > >> ... > >> trig-uprobe-emu-empty-insn > >> trig-uprobe-emu-func-return > >> trig-uprobe-emu-push-stack > >> trig-uretprobe-emu-empyt-insn > >> trig-uretprobe-emu-func-return > >> trig-uretprobe-emu-push-stack > >> ... > >> > >> This structure will allow for direct comparison of uprobe/uretprobe > >> performance across different architectures and instruction types. > >> Please let me know your thought, Andrii. > > > > Tbh, sounds a bit like an overkill. But before we decide on naming, > > what kind of situation is single-stepped on arm64? > > On Arm64, the following instruction types are generally not single-steppable. > > - Modifying and reading PC, including 'ret' and various branch instructions. > > - Forming a PC-relative address using the PC and an immediate value. > > - Generating exception, includes BRK, HLT, HVC, SMC, SVC. > > - Loading memory at address calculated based on the PC and an immediate offset. > > - Moving general-purpose register to system registers of PE (similar to logical cores on x86). > > - Hint instruction cause exception or unintend behavior for single-stepping. > However, 'nop' is steppable hint. > > Most parts of instructions that doesn't fall into any of these types are single-stepped, > including the Arm64 equvialents of 'push'. Ok, so you special-cased the "push %rbp" equivalent, by any other push-like instruction will be single-stepped, right? So how about we make sure that we have three classes of uprobe/uretprobe benchmarks: - fastest nop case, and we call it uprobe/uretprobe-usdt or keep it as uprobe/uretprobe-nop. USDT is sort of a target case for this, so I'm fine changing the name; - slightly less fast but common "push %rbp"-like case, which we can call uprobe-entry (as in function entry case); - and slowest single-stepped, here the naming is a bit less clear, so uprobe-slow or uprobe-ss (single-step, but if someone wants to read "super slow" I'm fine with it as well ;). Or uprobe-sstep, I don't know. WDYT? > > Thanks. > > > > >> > >> Thanks. > >> > >>> > >>>> > >>>>> > >>>>>>> trig-uretprobe-nop: 0.331 ± 0.004M/s (0.331M/prod) > >>>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod) > >>>>>>> trig-uretprobe-ret: 0.854 ± 0.002M/s (0.854M/prod) > >>>>>>> Redis SET (RPS) uprobe: 42728.52 > >>>>>>> Redis GET (RPS) uprobe: 43640.18 > >>>>>>> Redis SET (RPS) uretprobe: 40624.54 > >>>>>>> Redis GET (RPS) uretprobe: 41180.56 > >>>>>>> > >>>>>>> after-opt > >>>>>>> --------- > >>>>>>> trig-uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) > >>>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) > >>>>>>> trig-uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) > >>>>>>> trig-uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) > >>>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) > >>>>>>> trig-uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) > >>>>>>> Redis SET (RPS) uprobe: 43939.69 > >>>>>>> Redis GET (RPS) uprobe: 45200.80 > >>>>>>> Redis SET (RPS) uretprobe: 41658.58 > >>>>>>> Redis GET (RPS) uretprobe: 42805.80 > >>>>>>> > >>>>>>> While some uprobes might still need to share the same insn_slot, this > >>>>>>> patch compare the instructions in the resued insn_slot with the > >>>>>>> instructions execute out-of-line firstly to decides allocate a new one > >>>>>>> or not. > >>>>>>> > >>>>>>> Additionally, this patch use a rbtree associated with each thread that > >>>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the > >>>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less > >>>>>>> contention, it result in faster lookup times compared to find_uprobe(). > >>>>>>> > >>>>>>> The other part of this patch are some necessary memory management for > >>>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly > >>>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All > >>>>>>> uprobe_breakpoints will be freed when thread exit. > >>>>>>> > >>>>>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> > >>>>>>> --- > >>>>>>> include/linux/uprobes.h | 3 + > >>>>>>> kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++------- > >>>>>>> 2 files changed, 211 insertions(+), 38 deletions(-) > >>>>>>> > >>>>> > >>>>> [...] > >>>> > >>>> -- > >>>> BR > >>>> Liao, Chang > >> > >> -- > >> BR > >> Liao, Chang > > -- > BR > Liao, Chang