Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

"Liao, Chang" <liaochang1@xxxxxxxxxx> · Wed, 11 Sep 2024 11:06:57 +0800

在 2024/9/11 4:54, Andrii Nakryiko 写道:
> On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@xxxxxxxxxx> wrote:
>>
>> This patch is the second part of a series to improve the selftest bench
>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
>> significantly impact uprobe/uretprobe performance at function entry in
>> most user cases. Profiling results below reveals the STP that executes
>> in the xol slot and trap back to kernel, reduce redis RPS and increase
>> the time of string grep obviously.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> Redis GET (higher is better)
>> ----------------------------
>> No uprobe: 49149.71 RPS
>> Single-stepped STP: 46750.82 RPS
>> Emulated STP: 48981.19 RPS
>>
>> Redis SET (larger is better)
>> ----------------------------
>> No uprobe: 49761.14 RPS
>> Single-stepped STP: 45255.01 RPS
>> Emulated stp: 48619.21 RPS
>>
>> Grep (lower is better)
>> ----------------------
>> No uprobe: 2.165s
>> Single-stepped STP: 15.314s
>> Emualted STP: 2.216s
>>
>> Additionally, a profiling of the entry instruction for all leaf and
>> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
>> 50%. So simulting the STP on the function entry is a more viable option
>> for uprobe.
>>
>> In the first version [1], it used a uaccess routine to simulate the STP
>> that push fp/lr into stack, which use double STTR instructions for
>> memory store. But as Mark pointed out, this approach can't simulate the
>> correct single-atomicity and ordering properties of STP, especiallly
>> when it interacts with MTE, POE, etc. So this patch uses a more complex
> 
> Does all those effects matter if the thread is stopped after
> breakpoint? This is pushing to stack, right? Other threads are not
> supposed to access that memory anyways (not the well-defined ones, at
> least, I suppose). Do we really need all these complications for

I have raised the same question in my reply to Mark. Since the STP
simulation focuses on the uprobe/uretprob at function entry, which
push two registers onto *stack*. I believe it might not require strict
alignment with the exact property of STP. However, as you know, Mark
stand by his comments about STP simulation, which is why I send this
patch out. Although the gain is not good as the uaccess version, it
still offer some better result than the current XOL code.

> uprobes? We use a similar approach in x86-64, see emulate_push_stack()
> in arch/x86/kernel/uprobes.c and it works great in practice (and has

Yes, I've noticed the X86 routine. Actually. The CPU-specific difference
lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't
support STP (storing pairs of registers to memory) when accessing userpsace
address. This leads to kernel has to use STTR instructions (storing single
register to unprivileged memory) twice, which can't meet the atomicity
and ordering properties of original STP at userspace. In future, if Arm64
would add some instruction for storing pairs of registers to unprivileged
memory, it ought to replace this inefficient approach.

> been for years by now). Would be nice to keep things simple knowing
> that this is specifically for this rather well-defined and restricted
> uprobe/uretprobe use case.
> 
> Sorry, I can't help reviewing this, but I have a hunch that we might
> be over-killing it with this approach, no?

This approach fails to obtain the max benefit from simuation indeed.

> 
> 
>> and inefficient approach that acquires user stack pages, maps them to
>> kernel address space, and allows kernel to use STP directly push fp/lr
>> into the stack pages.
>>
>> xol-stp
>> -------
>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>>
>> simulated-stp
>> -------------
>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>>
>> The profiling results based on the upstream kernel with spinlock
>> optimization patches [2] reveals the simulation of STP increase the
>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>>
>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/
>>
>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
>> ---
>>  arch/arm64/include/asm/insn.h            |  1 +
>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>>  arch/arm64/lib/insn.c                    |  5 ++
>>  7 files changed, 134 insertions(+)
>>
> 
> [...]
> 
> 

-- 
BR
Liao, Chang