Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

"Liao, Chang" <liaochang1@xxxxxxxxxx> · Wed, 21 Aug 2024 16:46:47 +0800

在 2024/8/16 0:53, Andrii Nakryiko 写道:
> On Wed, Aug 14, 2024 at 7:58 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>
>>
>>
>> 在 2024/8/15 2:42, Andrii Nakryiko 写道:
>>> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Hi Andrii and Oleg.
>>>>>>>>
>>>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Liao,
>>>>>>>
>>>>>>> As you can see there is an active work to improve uprobes, that
>>>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>>>> hold off a bit with your changes until all that lands. And then
>>>>>>> re-benchmark, as costs might shift.
>>>>>>>
>>>>>>> But also see some remarks below.
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>>>
>>>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>>>> performance gain.
>>>>>>>>>
>>>>>>>>> before-opt
>>>>>>>>> ----------
>>>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>>>
>>>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>>>> explanation why this might be happening? I see you are trying to
>>>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>>>> variant, `push %rbp` instruction).
>>>>>>>
>>>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>>>> space to execute original instruction out-of-line, and then trapping
>>>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>>>> avoid.
>>>>>>>
>>>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>>>> and whether you can make those into fast uprobe cases?
>>>>>>
>>>>>> Hi Andrii,
>>>>>>
>>>>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>>>>>> compared to X86 behavior. My investigation revealed that the root cause lies in
>>>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>>>>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>>>>>> to handle these instructions out-of-line in userspace upon breakpoint exception
>>>>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>>>>>> which is already emulated.
>>>>>>
>>>>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>>>>>> 'push' variants. The benchmark results below indicates the performance gain of
>>>>>> emulation is obivious.
>>>>>>
>>>>>> xol (1 cpus)
>>>>>> ------------
>>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>
>>>>>> emulation (1 cpus)
>>>>>> -------------------
>>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>>>>>
>>>>>> As you can see, the performance gap between nop/push and ret variants has been significantly
>>>>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>>>>>> more cycles than the other.
>>>>>
>>>>> Great, it's an obvious improvement. Are you going to send patches
>>>>> upstream? Please cc bpf@xxxxxxxxxxxxxxx as well.
>>>>
>>>> I'll need more time to thoroughly test this patch. The emulation o push/nop
>>>> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
>>>> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
>>>> regression.
>>>
>>> Why would the *benchmarks* have to be modified? The typical
>>> kprobe/kretprobe attachment should be fast, and those benchmarks
>>> simulate typical fast path kprobe/kretprobe. Is there some simulation
>>> logic that is shared between uprobes and kprobes or something?
>>
>> Yes, kprobe and uprobe share many things for Arm64, but there are curical
>> difference. Let me explain further. Simulating a 'push' instruction on
>> arm64 will modify the stack pointer at *probe breakpoint. However, kprobe
>> and uprobe use different way to restore the stack pointer upon returning
>> from the breakpoint exception. Consequently.sharing the same simulation
>> logic for both would result in kernel panic for kprobe.
>>
>> To avoid complicating the exception return logic, I've opted to simuate
>> 'push' only for uprobe and maintain the single-stepping for kprobe [0].
>> This trade-off avoid the impacts to kprobe/kretprobe, and no need to
>> change the kprobe/kretprobe related benchmark.
>>
> 
> I see, thanks for explaining. I noticed the "bool kernel" flag you
> added, it makes sense.
> 
> I still don't understand why you'd need to modify kprobe/kretprobe
> benchmarks, as they are testing attaching kprobe at the kernel
> function entry, which for kernels should be an optimized case not
> requiring any emulation.

Sorry about the confusion. I've revised the implementation of nop/push
emulation to avoid the impacts the kprobe/kretprobe on Arm64, see the
change to arm_probe_decode_insn() in the patch [0]. As a result, no
changes to the kprobe/kretprobe benchmarks.

[0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@xxxxxxxxxx/

Thanks.

> 
>> [0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@xxxxxxxxxx/
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
>>>>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
>>>>> is still happy, slightly slower case (due to the need to emulate stack
>>>>> operation) and "-ret" is meant to be the slow single-step case. We
>>>>> should adjust the naming and make sure that on ARM64 we hit similar
>>>>> code paths. Given you seem to know arm64 pretty well, can you please
>>>>> take a look at updating bench tool for ARM64 (we can also rename
>>>>> benchmarks to something a bit more generic, rather than using
>>>>> instruction names)?
>>>>
>>>
>>> [...]
>>
>> --
>> BR
>> Liao, Chang
> 
> 

-- 
BR
Liao, Chang