Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 10 Sep 2024 13:54:00 -0700

On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@xxxxxxxxxx> wrote:
>
> This patch is the second part of a series to improve the selftest bench
> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> significantly impact uprobe/uretprobe performance at function entry in
> most user cases. Profiling results below reveals the STP that executes
> in the xol slot and trap back to kernel, reduce redis RPS and increase
> the time of string grep obviously.
>
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>
> Redis GET (higher is better)
> ----------------------------
> No uprobe: 49149.71 RPS
> Single-stepped STP: 46750.82 RPS
> Emulated STP: 48981.19 RPS
>
> Redis SET (larger is better)
> ----------------------------
> No uprobe: 49761.14 RPS
> Single-stepped STP: 45255.01 RPS
> Emulated stp: 48619.21 RPS
>
> Grep (lower is better)
> ----------------------
> No uprobe: 2.165s
> Single-stepped STP: 15.314s
> Emualted STP: 2.216s
>
> Additionally, a profiling of the entry instruction for all leaf and
> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> 50%. So simulting the STP on the function entry is a more viable option
> for uprobe.
>
> In the first version [1], it used a uaccess routine to simulate the STP
> that push fp/lr into stack, which use double STTR instructions for
> memory store. But as Mark pointed out, this approach can't simulate the
> correct single-atomicity and ordering properties of STP, especiallly
> when it interacts with MTE, POE, etc. So this patch uses a more complex

Does all those effects matter if the thread is stopped after
breakpoint? This is pushing to stack, right? Other threads are not
supposed to access that memory anyways (not the well-defined ones, at
least, I suppose). Do we really need all these complications for
uprobes? We use a similar approach in x86-64, see emulate_push_stack()
in arch/x86/kernel/uprobes.c and it works great in practice (and has
been for years by now). Would be nice to keep things simple knowing
that this is specifically for this rather well-defined and restricted
uprobe/uretprobe use case.

Sorry, I can't help reviewing this, but I have a hunch that we might
be over-killing it with this approach, no?

> and inefficient approach that acquires user stack pages, maps them to
> kernel address space, and allows kernel to use STP directly push fp/lr
> into the stack pages.
>
> xol-stp
> -------
> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>
> simulated-stp
> -------------
> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>
> The profiling results based on the upstream kernel with spinlock
> optimization patches [2] reveals the simulation of STP increase the
> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/
>
> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
> ---
>  arch/arm64/include/asm/insn.h            |  1 +
>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>  arch/arm64/lib/insn.c                    |  5 ++
>  7 files changed, 134 insertions(+)
>

[...]