Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>> within the mailing list. Considering this interest, I've added you and other relevant
>> maintainers to the CC list for broader visibility and potential collaboration.
>>
> 
> Hi Liao,
> 
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.
> 
> But also see some remarks below.
> 
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> ----------
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> 
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
> 
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
> 
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?

Hi Andrii,

As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
compared to X86 behavior. My investigation revealed that the root cause lies in
the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
to handle these instructions out-of-line in userspace upon breakpoint exception
is handled, leading to a significant performance overhead compared to 'ret' variant,
which is already emulated.

To address this issue, I've developed a patch supports  the emulation of 'nop' and
'push' variants. The benchmark results below indicates the performance gain of
emulation is obivious.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)

As you can see, the performance gap between nop/push and ret variants has been significantly
reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
more cycles than the other.

> 
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> ---------
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
>>> ---
>>>  include/linux/uprobes.h |   3 +
>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>
> 
> [...]

-- 
BR
Liao, Chang




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux