在 2024/8/9 2:26, Andrii Nakryiko 写道: > On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: >> >> Hi Andrii and Oleg. >> >> This patch sent by me two weeks ago also aim to optimize the performance of uprobe >> on arm64. I notice recent discussions on the performance and scalability of uprobes >> within the mailing list. Considering this interest, I've added you and other relevant >> maintainers to the CC list for broader visibility and potential collaboration. >> > > Hi Liao, > > As you can see there is an active work to improve uprobes, that > changes lifetime management of uprobes, removes a bunch of locks taken > in the uprobe/uretprobe hot path, etc. It would be nice if you can > hold off a bit with your changes until all that lands. And then > re-benchmark, as costs might shift. Andrii, I'm trying to integrate your lockless changes into the upstream next-20240806 kernel tree. And I ran into some conflicts. please let me know which kernel you're currently working on. Thanks. > > But also see some remarks below. > >> Thanks. >> >> 在 2024/7/27 17:44, Liao Chang 写道: >>> The profiling result of single-thread model of selftests bench reveals >>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on >>> ARM64. On my local testing machine, 5% of CPU time is consumed by >>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take >>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push. >>> >>> This patch introduce struct uprobe_breakpoint to track previously >>> allocated insn_slot for frequently hit uprobe. it effectively reduce the >>> need for redundant insn_slot writes and subsequent expensive cache >>> flush, especially on architecture like ARM64. This patch has been tested >>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest >>> bench and Redis GET/SET benchmark result below reveal obivious >>> performance gain. >>> >>> before-opt >>> ---------- >>> trig-uprobe-nop: 0.371 ± 0.001M/s (0.371M/prod) >>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod) >>> trig-uprobe-ret: 1.637 ± 0.001M/s (1.647M/prod) > > I'm surprised that nop and push variants are much slower than ret > variant. This is exactly opposite on x86-64. Do you have an > explanation why this might be happening? I see you are trying to > optimize xol_get_insn_slot(), but that is (at least for x86) a slow > variant of uprobe that normally shouldn't be used. Typically uprobe is > installed on nop (for USDT) and on function entry (which would be push > variant, `push %rbp` instruction). > > ret variant, for x86-64, causes one extra step to go back to user > space to execute original instruction out-of-line, and then trapping > back to kernel for running uprobe. Which is what you normally want to > avoid. > > What I'm getting at here. It seems like maybe arm arch is missing fast > emulated implementations for nops/push or whatever equivalents for > ARM64 that is. Please take a look at that and see why those are slow > and whether you can make those into fast uprobe cases? I will spend the weekend figuring out the questions you raised. Thanks for pointing them out. > >>> trig-uretprobe-nop: 0.331 ± 0.004M/s (0.331M/prod) >>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod) >>> trig-uretprobe-ret: 0.854 ± 0.002M/s (0.854M/prod) >>> Redis SET (RPS) uprobe: 42728.52 >>> Redis GET (RPS) uprobe: 43640.18 >>> Redis SET (RPS) uretprobe: 40624.54 >>> Redis GET (RPS) uretprobe: 41180.56 >>> >>> after-opt >>> --------- >>> trig-uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) >>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) >>> trig-uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) >>> trig-uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) >>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) >>> trig-uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) >>> Redis SET (RPS) uprobe: 43939.69 >>> Redis GET (RPS) uprobe: 45200.80 >>> Redis SET (RPS) uretprobe: 41658.58 >>> Redis GET (RPS) uretprobe: 42805.80 >>> >>> While some uprobes might still need to share the same insn_slot, this >>> patch compare the instructions in the resued insn_slot with the >>> instructions execute out-of-line firstly to decides allocate a new one >>> or not. >>> >>> Additionally, this patch use a rbtree associated with each thread that >>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the >>> rbtree of uprobe_breakpoints has smaller node, better locality and less >>> contention, it result in faster lookup times compared to find_uprobe(). >>> >>> The other part of this patch are some necessary memory management for >>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly >>> hit uprobe that doesn't already have a corresponding node in rbtree. All >>> uprobe_breakpoints will be freed when thread exit. >>> >>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> >>> --- >>> include/linux/uprobes.h | 3 + >>> kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++------- >>> 2 files changed, 211 insertions(+), 38 deletions(-) >>> > > [...] -- BR Liao, Chang