Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 25 Oct 2024 13:51:14 -0700

On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> > This patch is the second part of a series to improve the selftest bench
> > of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> > significantly impact uprobe/uretprobe performance at function entry in
> > most user cases. Profiling results below reveals the STP that executes
> > in the xol slot and trap back to kernel, reduce redis RPS and increase
> > the time of string grep obviously.
> >
> > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >
> > Redis GET (higher is better)
> > ----------------------------
> > No uprobe: 49149.71 RPS
> > Single-stepped STP: 46750.82 RPS
> > Emulated STP: 48981.19 RPS
> >
> > Redis SET (larger is better)
> > ----------------------------
> > No uprobe: 49761.14 RPS
> > Single-stepped STP: 45255.01 RPS
> > Emulated stp: 48619.21 RPS
> >
> > Grep (lower is better)
> > ----------------------
> > No uprobe: 2.165s
> > Single-stepped STP: 15.314s
> > Emualted STP: 2.216s
>
> The results for grep are concerning.
>
> In theory, the overhead for stepping should be roughly double the
> overhead for emulating, assuming the exception-entry and
> exception-return are the dominant cost. The cost of stepping should be
> trivial.
>
> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> more expensive.
>
> Was this tested bare-metal, or in a VM?

Hey Mark, I hope Liao will have a chance to reply, I don't know the
details of his benchmarking. But I can try to give you my numbers and
maybe answer a few questions, hopefully that helps move the
conversation forward.

So, first of all, I did a quick benchmark on bare metal (without
Liao's optimization, though), here are my results:

uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)

uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)

uretprobes are inherently slower, so I'll just compare uprobe, as the
differences are very clear either way.

-nop is literally nop (Liao solved that issue, I just don't have his
patch applied on my test machine). -push has `stp     x29, x30, [sp,
#-0x10]!` instruction traced. -ret is literally just `ret`
instruction.

So you can see that -ret is almost twice as fast as the -push variant
(it's a microbenchmark, yes, but still).

>
> AFAICT either:
>
> * Single-stepping is unexpectedly expensive.
>
>   Historically we had performance issues with hypervisor trapping of
>   debug features, and there are things we might be able to improve in
>   the hypervisor and kernel, which would improve stepping *all*
>   instructions.
>

Single-stepping will always be more expensive, as it necessitates
extra hop kernel->user space->kernel, so no matter the optimization
for single-stepping, if we can avoid it, we should. It will be
noticeable.

>   If stepping is the big problem, we could move uprobes over to a BRK
>   rather than a single-step. That would require require updating and
>   fixing the logic to decide which instructions are steppable, but
>   that's necessary anyway given it has extant soundness issues.

I'm afraid I don't understand what BRK means and what are the
consequences in terms of overheads. I'm not an ARM person either, so
sorry if that's a stupid question. But either way, I can't address
this. But see above, emulating an instruction feels like a much better
approach, if possible.

>
> * XOL management is absurdly expensive.
>
>   Does uprobes keep the XOL slot around (like krpobes does), or does it
>   create the slot afresh for each trap?

XOL *page* is created once per process, lazily, and then we just
juggle a bunch of fixed slots there for each instance of
single-stepped uprobe. And yes, there are some bottlenecks in XOL
management, though it's mostly due to lock contention (as it is
implemented right now). Liao and Oleg have been improving XOL
management, but still, avoiding XOL in the first place is the much
preferred way.

>
>   If that's trying to create a slot afresh for each trap, there are
>   several opportunities for improvement, e.g. keep the slot around for
>   as long as the uprobe exists, or pre-allocate shared slots for common
>   instructions and use those.

As I mentioned, a XOL page is allocated and mapped once, but yes, it
seems like we dynamically get a slot in it for each single-stepped
execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
probably not a bad idea to just cache and hold a XOL slot for each
specific uprobe, I don't see why we should limit ourselves to just one
XOL page. We also don't need to pre-size each slot, we can probably
allocate just the right amount of space for a given uprobe.

All good ideas for sure, we should do them, IMO. But we'll still be
paying an extra kernel->user->kernel switch, which almost certainly is
slower than doing a simple stack push emulation just like we do in
x86-64 case, no?

BTW, I did a quick local profiling run. I don't think XOL management
is the main source of overhead. I see 5% of CPU cycles spent in
arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
traces. There are at least 22% CPU cycles spent in some
local_daif_restore function, though, not sure what that is, but might
be related to interrupt handling, right?

The take away I'd like to communicate here is avoiding the
single-stepping need is *the best way* to go, IMO. So if we can
emulate those STP instructions for uprobe *cheaply*, that would be
awesome.

>
> Mark.
>
> >
> > Additionally, a profiling of the entry instruction for all leaf and
> > non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> > 50%. So simulting the STP on the function entry is a more viable option
> > for uprobe.
> >
> > In the first version [1], it used a uaccess routine to simulate the STP
> > that push fp/lr into stack, which use double STTR instructions for
> > memory store. But as Mark pointed out, this approach can't simulate the
> > correct single-atomicity and ordering properties of STP, especiallly
> > when it interacts with MTE, POE, etc. So this patch uses a more complex
> > and inefficient approach that acquires user stack pages, maps them to
> > kernel address space, and allows kernel to use STP directly push fp/lr
> > into the stack pages.
> >
> > xol-stp
> > -------
> > uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> > uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> > uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> > uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> > uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> > uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >
> > simulated-stp
> > -------------
> > uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> > uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> > uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> > uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> > uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> > uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >
> > The profiling results based on the upstream kernel with spinlock
> > optimization patches [2] reveals the simulation of STP increase the
> > uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> > uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >
> > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
> > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> > [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/
> >
> > Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
> > ---
> >  arch/arm64/include/asm/insn.h            |  1 +
> >  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >  arch/arm64/lib/insn.c                    |  5 ++
> >  7 files changed, 134 insertions(+)
> >

[...]