Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 6 Nov 2024 11:45:06 -0800

On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>
> Andrii and Mark.
>
> 在 2024/10/26 4:51, Andrii Nakryiko 写道:
> > On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >>
> >> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> >>> This patch is the second part of a series to improve the selftest bench
> >>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> >>> significantly impact uprobe/uretprobe performance at function entry in
> >>> most user cases. Profiling results below reveals the STP that executes
> >>> in the xol slot and trap back to kernel, reduce redis RPS and increase
> >>> the time of string grep obviously.
> >>>
> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>
> >>> Redis GET (higher is better)
> >>> ----------------------------
> >>> No uprobe: 49149.71 RPS
> >>> Single-stepped STP: 46750.82 RPS
> >>> Emulated STP: 48981.19 RPS
> >>>
> >>> Redis SET (larger is better)
> >>> ----------------------------
> >>> No uprobe: 49761.14 RPS
> >>> Single-stepped STP: 45255.01 RPS
> >>> Emulated stp: 48619.21 RPS
> >>>
> >>> Grep (lower is better)
> >>> ----------------------
> >>> No uprobe: 2.165s
> >>> Single-stepped STP: 15.314s
> >>> Emualted STP: 2.216s
> >>
> >> The results for grep are concerning.
> >>
> >> In theory, the overhead for stepping should be roughly double the
> >> overhead for emulating, assuming the exception-entry and
> >> exception-return are the dominant cost. The cost of stepping should be
> >> trivial.
> >>
> >> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> >> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> >> more expensive.
> >>
> >> Was this tested bare-metal, or in a VM?
> >
> > Hey Mark, I hope Liao will have a chance to reply, I don't know the
> > details of his benchmarking. But I can try to give you my numbers and
> > maybe answer a few questions, hopefully that helps move the
> > conversation forward.
> >
> > So, first of all, I did a quick benchmark on bare metal (without
> > Liao's optimization, though), here are my results:
> >
> > uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
> > uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
> > uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)
> >
> > uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
> > uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
> > uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)
> >
> > uretprobes are inherently slower, so I'll just compare uprobe, as the
> > differences are very clear either way.
> >
> > -nop is literally nop (Liao solved that issue, I just don't have his
> > patch applied on my test machine). -push has `stp     x29, x30, [sp,
> > #-0x10]!` instruction traced. -ret is literally just `ret`
> > instruction.
> >
> > So you can see that -ret is almost twice as fast as the -push variant
> > (it's a microbenchmark, yes, but still).
> >
> >>
> >> AFAICT either:
> >>
> >> * Single-stepping is unexpectedly expensive.
> >>
> >>   Historically we had performance issues with hypervisor trapping of
> >>   debug features, and there are things we might be able to improve in
> >>   the hypervisor and kernel, which would improve stepping *all*
> >>   instructions.
> >>
> >
> > Single-stepping will always be more expensive, as it necessitates
> > extra hop kernel->user space->kernel, so no matter the optimization
> > for single-stepping, if we can avoid it, we should. It will be
> > noticeable.
> >
> >>   If stepping is the big problem, we could move uprobes over to a BRK
> >>   rather than a single-step. That would require require updating and
> >>   fixing the logic to decide which instructions are steppable, but
> >>   that's necessary anyway given it has extant soundness issues.
> >
> > I'm afraid I don't understand what BRK means and what are the
> > consequences in terms of overheads. I'm not an ARM person either, so
> > sorry if that's a stupid question. But either way, I can't address
> > this. But see above, emulating an instruction feels like a much better
> > approach, if possible.
>
> As I understand, Mark's suggestion is to place a BRK instruction next to
> the instruction in the xol slot. Once the instruction in the xol slot
> executed, the BRK instruction would trigger a trap into kernel. This is
> a common technique used on platforms that don't support hardware single-
> step. However, since Arm64 does support hardware single-stepping, kernel
> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel
> after instruction in xol slot executed. But even we move uprobes over
> to a BRK rather than a single-step. It can't reduce the overhead of user->
> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark,
> could you give more details about the BRK.
>

I see, thanks for elaborating. So the suggestion was to go from very
expensive single-stepping mode to still expensive breakpoint-based
kernel->user->kernel workflow.

I think either way it's going to be much slower than avoiding
kernel->user->kernel hop, so we should emulate STP instead, yep.

> >
> >>
> >> * XOL management is absurdly expensive.
> >>
> >>   Does uprobes keep the XOL slot around (like krpobes does), or does it
> >>   create the slot afresh for each trap?
> >
> > XOL *page* is created once per process, lazily, and then we just
> > juggle a bunch of fixed slots there for each instance of
> > single-stepped uprobe. And yes, there are some bottlenecks in XOL
> > management, though it's mostly due to lock contention (as it is
> > implemented right now). Liao and Oleg have been improving XOL
> > management, but still, avoiding XOL in the first place is the much
> > preferred way.
> >
> >>
> >>   If that's trying to create a slot afresh for each trap, there are
> >>   several opportunities for improvement, e.g. keep the slot around for
> >>   as long as the uprobe exists, or pre-allocate shared slots for common
> >>   instructions and use those.
> >
> > As I mentioned, a XOL page is allocated and mapped once, but yes, it
> > seems like we dynamically get a slot in it for each single-stepped
> > execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
> > probably not a bad idea to just cache and hold a XOL slot for each
> > specific uprobe, I don't see why we should limit ourselves to just one
> > XOL page. We also don't need to pre-size each slot, we can probably
> > allocate just the right amount of space for a given uprobe.
> >
> > All good ideas for sure, we should do them, IMO. But we'll still be
> > paying an extra kernel->user->kernel switch, which almost certainly is
> > slower than doing a simple stack push emulation just like we do in
> > x86-64 case, no?
> >
> >
> > BTW, I did a quick local profiling run. I don't think XOL management
> > is the main source of overhead. I see 5% of CPU cycles spent in
> > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
> > traces. There are at least 22% CPU cycles spent in some
> > local_daif_restore function, though, not sure what that is, but might
> > be related to interrupt handling, right?
>
> The local_daif_restore() is part of the path for all user->kernel->user
> context switch, including interrupt handling, breakpoints, and single-stepping
> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't
> been enable to reproduce this on my local machine.
>
> Andrii, could you use the patch below to see if it can reduce the 5% of
> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache
> synchronization is the cause of this part of overhead.
>
> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@xxxxxxxxxx/

tbh, I think pre-allocating and setting up fixed XOL slots instead of
dynamically allocating them is the way to go. We can allocate as many
special "[uprobes]" pages as necessary to accommodate all the
single-stepped uprobes in XOL, remember their index, etc. I think
that's much more performant (and simpler, IMO) approach overall. All
this preparation, however expensive it might be, will be done once per
each attached/detached uprobe/uretprobe, which is the place where we
want to do expensive stuff. Not when uprobe/uretprobe is actually
triggered.

>
> >
> >
> > The take away I'd like to communicate here is avoiding the
> > single-stepping need is *the best way* to go, IMO. So if we can
> > emulate those STP instructions for uprobe *cheaply*, that would be
> > awesome.
>
> Given some significant uprobe optimizations from Oleg and Andrii
> merged, I am curious to see how these changes impact the profiling
> result on Arm64. So I re-ran the selftest bench on the latest kernel
> (based on tag next-20241104) and the kernel (based on tag next-20240909)
> that I used when I submitted this patch. The results re-ran are shown
> below.
>
> next-20240909(xol stp + xol nop)
> --------------------------------
> uprobe-nop      ( 1 cpus):    0.424 ± 0.000M/s  (  0.424M/s/cpu)
> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.101 ± 0.002M/s  (  2.101M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.347 ± 0.000M/s  (  0.347M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.349 ± 0.000M/s  (  0.349M/s/cpu)
> uretprobe-ret   ( 1 cpus):    1.051 ± 0.001M/s  (  1.051M/s/cpu)
>
> next-20240909(sim stp + sim nop)
> --------------------------------
> uprobe-nop      ( 1 cpus):    2.042 ± 0.002M/s  (  2.042M/s/cpu)
> uprobe-push     ( 1 cpus):    1.363 ± 0.002M/s  (  1.363M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> uretprobe-nop   ( 1 cpus):    1.049 ± 0.001M/s  (  1.049M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.780 ± 0.000M/s  (  0.780M/s/cpu)
> uretprobe-ret   ( 1 cpus):    1.065 ± 0.001M/s  (  1.065M/s/cpu)
>
> next-20241104 (xol stp + sim nop)
> ---------------------------------
> uprobe-nop      ( 1 cpus):    2.044 ± 0.003M/s  (  2.044M/s/cpu)
> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.047 ± 0.001M/s  (  2.047M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.832 ± 0.003M/s  (  0.832M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.833 ± 0.003M/s  (  0.833M/s/cpu)
>
> next-20241104 (sim stp + sim nop)
> ---------------------------------
> uprobe-nop      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.052 ± 0.005M/s  (  2.052M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.839 ± 0.005M/s  (  0.839M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.837 ± 0.001M/s  (  0.837M/s/cpu)
>
> It seems that the STP simluation approach in this patch significantly
> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s)
> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels
> bases on next-20240909 and next-20241104. While there is still room
> for improvement to reach the throughput of -nop and -ret, the gains
> are very substantail.
>
> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using
> single-stepping stp, which are far lower compared to the result when
> when I submitted patch(look closely to the uprobe-push and uretprobe-push
> results in commit log). I'm certain that the tests were run on the
> same bare-metal machine with background tasked minimized. I doubt some
> uncommitted uprobe optimization on my local repo twist the result of
> -push using single-step.

You can always profiler and compare before/after, right? See where
added costs are coming from?

>
> In addition to the micro benchmark, I also re-ran Redis benchmark to
> compare the impact of single-stepping STP and simluated STP to the
> throughput of redis-server. I believe the impact of uprobe on the real
> application depends on the frequency of uprobe triggered and the application's
> hot paths. Therefore, I wouldn't say the simluated STP will benefit all
> real world applications.

It will benefit *a lot* of real world applications, though, so I think
it's very important to improve.

>
> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET
> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & &&
>   bpftrace -e 'uprobe:redis-server:readQueryFromClient{}
>                uprobe:redis-server:processCommand{}
>                uprobe:redis-server:aeApiPoll {}'
>
> next-20241104
> -------------
> RPS: 55602.1
>
> next-20241104 + ss stp
> ----------------------
> RPS: 47220.9
> uprobe@@aeApiPoll: 554565
> uprobe@processCommand: 1275160
> uprobe@readQueryFromClient: 1277710
>
> next-20241104 + sim stp
> -----------------------
> RPS           54290.09
> uprobe@aeApiPoll: 496007
> uprobe@processCommand: 1275160
> uprobe@readQueryFromClient: 1277710
>
> Andrii expressed concern that the STP simulation in this patch is too
> expensive. If we believe the result I re-ran, perhaps it is not a
> bad way to simluate STP. Looking forward to your feedbacks, or someone
> could propose a cheaper way to simluate STP, I'm very happy to test it
> on my machine, thanks.

I'm no ARM64 expert, but seeing that we simulate stack pushes with
just memory reads/write for x86-64, it feels like that should be
satisfactory for ARM64. So I'd suggest you to go back to the initial
implementation, clean it up, rebase, re-benchmark, and send a new
revision. Let's continue the discussion there?

>
> [...]
>
> >>>
> >>> xol-stp
> >>> -------
> >>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> >>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> >>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> >>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> >>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> >>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >>>
> >>> simulated-stp
> >>> -------------
> >>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> >>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> >>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> >>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> >>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> >>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >>>
> >>> The profiling results based on the upstream kernel with spinlock
> >>> optimization patches [2] reveals the simulation of STP increase the
> >>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> >>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >>>
> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
> >>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> >>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
> >>> ---
> >>>  arch/arm64/include/asm/insn.h            |  1 +
> >>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >>>  arch/arm64/lib/insn.c                    |  5 ++
> >>>  7 files changed, 134 insertions(+)
> >>>
> >
> > [...]
>
> --
> BR
> Liao, Chang
>