Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

"Liao, Chang" <liaochang1@xxxxxxxxxx> · Thu, 15 Aug 2024 15:59:50 +0800

在 2024/8/15 0:57, Andrii Nakryiko 写道:
> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>
>>
>>
>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> Hi Andrii and Oleg.
>>>>>>
>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>
>>>>>
>>>>> Hi Liao,
>>>>>
>>>>> As you can see there is an active work to improve uprobes, that
>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>> hold off a bit with your changes until all that lands. And then
>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> But also see some remarks below.
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>
>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>> performance gain.
>>>>>>>
>>>>>>> before-opt
>>>>>>> ----------
>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>
>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>> explanation why this might be happening? I see you are trying to
>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>> variant, `push %rbp` instruction).
>>>>>
>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>> space to execute original instruction out-of-line, and then trapping
>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>> avoid.
>>>>>
>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>> and whether you can make those into fast uprobe cases?
>>>>
>>>> Hi Andrii,
>>>>
>>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>>>> compared to X86 behavior. My investigation revealed that the root cause lies in
>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>>>> to handle these instructions out-of-line in userspace upon breakpoint exception
>>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>>>> which is already emulated.
>>>>
>>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>>>> 'push' variants. The benchmark results below indicates the performance gain of
>>>> emulation is obivious.
>>>>
>>>> xol (1 cpus)
>>>> ------------
>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>
>>>> emulation (1 cpus)
>>>> -------------------
>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>>>
>>>> As you can see, the performance gap between nop/push and ret variants has been significantly
>>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>>>> more cycles than the other.
>>>
>>> Great, it's an obvious improvement. Are you going to send patches
>>> upstream? Please cc bpf@xxxxxxxxxxxxxxx as well.
>>
>> I'll need more time to thoroughly test this patch. The emulation o push/nop
>> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
>> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
>> regression.
>>
>>>
>>>
>>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
>>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
>>> is still happy, slightly slower case (due to the need to emulate stack
>>> operation) and "-ret" is meant to be the slow single-step case. We
>>> should adjust the naming and make sure that on ARM64 we hit similar
>>> code paths. Given you seem to know arm64 pretty well, can you please
>>> take a look at updating bench tool for ARM64 (we can also rename
>>> benchmarks to something a bit more generic, rather than using
>>> instruction names)?
>>
>> Let me use a matrix below for the structured comparsion of uprobe/uretprobe
>> benchmarks on X86 and Arm64:
>>
>> Architecture  Instrution Type   Handling method   Performance
>> X86           nop               Emulated          Fastest
>> X86           push              Emulated          Fast
>> X86           ret               Single-step       Slow
>> Arm64         nop               Emulated          Fastest
>> Arm64         push              Emulated          Fast
>> Arm64         ret               Emulated          Faster
>>
>> I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss'
>> for 'single-steppable' instructions. Generally, emulated instructions should
>> outperform single-step ones across different architectures. Regarding the
>> generic naming, I propose using a self-explanatory style, such as
>> s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g.
>>
>> Above all, example "bench --list" output:
>>
>> X86:
>>   ...
>>   trig-uprobe-emu-empty-insn
>>   trig-uprobe-ss-func-return
>>   trig-uprobe-emu-push-stack
>>   trig-uretprobe-emu-empyt-insn
>>   trig-uretprobe-ss-func-return
>>   trig-uretprobe-emu-push-stack
>>   ...
>>
>> Arm64:
>>   ...
>>   trig-uprobe-emu-empty-insn
>>   trig-uprobe-emu-func-return
>>   trig-uprobe-emu-push-stack
>>   trig-uretprobe-emu-empyt-insn
>>   trig-uretprobe-emu-func-return
>>   trig-uretprobe-emu-push-stack
>>   ...
>>
>> This structure will allow for direct comparison of uprobe/uretprobe
>> performance across different architectures and instruction types.
>> Please let me know your thought, Andrii.
> 
> Tbh, sounds a bit like an overkill. But before we decide on naming,
> what kind of situation is single-stepped on arm64?

On Arm64, the following instruction types are generally not single-steppable.

  - Modifying and reading PC, including 'ret' and various branch instructions.

  - Forming a PC-relative address using the PC and an immediate value.

  - Generating exception, includes BRK, HLT, HVC, SMC, SVC.

  - Loading memory at address calculated based on the PC and an immediate offset.

  - Moving general-purpose register to system registers of PE (similar to logical cores on x86).

  - Hint instruction cause exception or unintend behavior for single-stepping.
    However, 'nop' is steppable hint.

Most parts of instructions that doesn't fall into any of these types are single-stepped,
including the Arm64 equvialents of 'push'.

Thanks.

> 
>>
>> Thanks.
>>
>>>
>>>>
>>>>>
>>>>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>>>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>>>>>> Redis SET (RPS) uprobe: 42728.52
>>>>>>> Redis GET (RPS) uprobe: 43640.18
>>>>>>> Redis SET (RPS) uretprobe: 40624.54
>>>>>>> Redis GET (RPS) uretprobe: 41180.56
>>>>>>>
>>>>>>> after-opt
>>>>>>> ---------
>>>>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>> Redis SET (RPS) uprobe: 43939.69
>>>>>>> Redis GET (RPS) uprobe: 45200.80
>>>>>>> Redis SET (RPS) uretprobe: 41658.58
>>>>>>> Redis GET (RPS) uretprobe: 42805.80
>>>>>>>
>>>>>>> While some uprobes might still need to share the same insn_slot, this
>>>>>>> patch compare the instructions in the resued insn_slot with the
>>>>>>> instructions execute out-of-line firstly to decides allocate a new one
>>>>>>> or not.
>>>>>>>
>>>>>>> Additionally, this patch use a rbtree associated with each thread that
>>>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>>>>>> contention, it result in faster lookup times compared to find_uprobe().
>>>>>>>
>>>>>>> The other part of this patch are some necessary memory management for
>>>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>>>>>> uprobe_breakpoints will be freed when thread exit.
>>>>>>>
>>>>>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
>>>>>>> ---
>>>>>>>  include/linux/uprobes.h |   3 +
>>>>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>
>>>>> [...]
>>>>
>>>> --
>>>> BR
>>>> Liao, Chang
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang