Andrii and Mark. 在 2024/10/26 4:51, Andrii Nakryiko 写道: > On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@xxxxxxx> wrote: >> >> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: >>> This patch is the second part of a series to improve the selftest bench >>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' >>> significantly impact uprobe/uretprobe performance at function entry in >>> most user cases. Profiling results below reveals the STP that executes >>> in the xol slot and trap back to kernel, reduce redis RPS and increase >>> the time of string grep obviously. >>> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. >>> >>> Redis GET (higher is better) >>> ---------------------------- >>> No uprobe: 49149.71 RPS >>> Single-stepped STP: 46750.82 RPS >>> Emulated STP: 48981.19 RPS >>> >>> Redis SET (larger is better) >>> ---------------------------- >>> No uprobe: 49761.14 RPS >>> Single-stepped STP: 45255.01 RPS >>> Emulated stp: 48619.21 RPS >>> >>> Grep (lower is better) >>> ---------------------- >>> No uprobe: 2.165s >>> Single-stepped STP: 15.314s >>> Emualted STP: 2.216s >> >> The results for grep are concerning. >> >> In theory, the overhead for stepping should be roughly double the >> overhead for emulating, assuming the exception-entry and >> exception-return are the dominant cost. The cost of stepping should be >> trivial. >> >> Those results show emulating adds 0.051s (for a ~2.4% overhead), while >> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x >> more expensive. >> >> Was this tested bare-metal, or in a VM? > > Hey Mark, I hope Liao will have a chance to reply, I don't know the > details of his benchmarking. But I can try to give you my numbers and > maybe answer a few questions, hopefully that helps move the > conversation forward. > > So, first of all, I did a quick benchmark on bare metal (without > Liao's optimization, though), here are my results: > > uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) > uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) > uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) > > uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) > uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) > uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) > > uretprobes are inherently slower, so I'll just compare uprobe, as the > differences are very clear either way. > > -nop is literally nop (Liao solved that issue, I just don't have his > patch applied on my test machine). -push has `stp x29, x30, [sp, > #-0x10]!` instruction traced. -ret is literally just `ret` > instruction. > > So you can see that -ret is almost twice as fast as the -push variant > (it's a microbenchmark, yes, but still). > >> >> AFAICT either: >> >> * Single-stepping is unexpectedly expensive. >> >> Historically we had performance issues with hypervisor trapping of >> debug features, and there are things we might be able to improve in >> the hypervisor and kernel, which would improve stepping *all* >> instructions. >> > > Single-stepping will always be more expensive, as it necessitates > extra hop kernel->user space->kernel, so no matter the optimization > for single-stepping, if we can avoid it, we should. It will be > noticeable. > >> If stepping is the big problem, we could move uprobes over to a BRK >> rather than a single-step. That would require require updating and >> fixing the logic to decide which instructions are steppable, but >> that's necessary anyway given it has extant soundness issues. > > I'm afraid I don't understand what BRK means and what are the > consequences in terms of overheads. I'm not an ARM person either, so > sorry if that's a stupid question. But either way, I can't address > this. But see above, emulating an instruction feels like a much better > approach, if possible. As I understand, Mark's suggestion is to place a BRK instruction next to the instruction in the xol slot. Once the instruction in the xol slot executed, the BRK instruction would trigger a trap into kernel. This is a common technique used on platforms that don't support hardware single- step. However, since Arm64 does support hardware single-stepping, kernel enables it in pre_ssout(), allowing the CPU to automatically trap into kernel after instruction in xol slot executed. But even we move uprobes over to a BRK rather than a single-step. It can't reduce the overhead of user-> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, could you give more details about the BRK. > >> >> * XOL management is absurdly expensive. >> >> Does uprobes keep the XOL slot around (like krpobes does), or does it >> create the slot afresh for each trap? > > XOL *page* is created once per process, lazily, and then we just > juggle a bunch of fixed slots there for each instance of > single-stepped uprobe. And yes, there are some bottlenecks in XOL > management, though it's mostly due to lock contention (as it is > implemented right now). Liao and Oleg have been improving XOL > management, but still, avoiding XOL in the first place is the much > preferred way. > >> >> If that's trying to create a slot afresh for each trap, there are >> several opportunities for improvement, e.g. keep the slot around for >> as long as the uprobe exists, or pre-allocate shared slots for common >> instructions and use those. > > As I mentioned, a XOL page is allocated and mapped once, but yes, it > seems like we dynamically get a slot in it for each single-stepped > execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's > probably not a bad idea to just cache and hold a XOL slot for each > specific uprobe, I don't see why we should limit ourselves to just one > XOL page. We also don't need to pre-size each slot, we can probably > allocate just the right amount of space for a given uprobe. > > All good ideas for sure, we should do them, IMO. But we'll still be > paying an extra kernel->user->kernel switch, which almost certainly is > slower than doing a simple stack push emulation just like we do in > x86-64 case, no? > > > BTW, I did a quick local profiling run. I don't think XOL management > is the main source of overhead. I see 5% of CPU cycles spent in > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > traces. There are at least 22% CPU cycles spent in some > local_daif_restore function, though, not sure what that is, but might > be related to interrupt handling, right? The local_daif_restore() is part of the path for all user->kernel->user context switch, including interrupt handling, breakpoints, and single-stepping etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't been enable to reproduce this on my local machine. Andrii, could you use the patch below to see if it can reduce the 5% of CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache synchronization is the cause of this part of overhead. https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@xxxxxxxxxx/ > > > The take away I'd like to communicate here is avoiding the > single-stepping need is *the best way* to go, IMO. So if we can > emulate those STP instructions for uprobe *cheaply*, that would be > awesome. Given some significant uprobe optimizations from Oleg and Andrii merged, I am curious to see how these changes impact the profiling result on Arm64. So I re-ran the selftest bench on the latest kernel (based on tag next-20241104) and the kernel (based on tag next-20240909) that I used when I submitted this patch. The results re-ran are shown below. next-20240909(xol stp + xol nop) -------------------------------- uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) next-20240909(sim stp + sim nop) -------------------------------- uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) next-20241104 (xol stp + sim nop) --------------------------------- uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) next-20241104 (sim stp + sim nop) --------------------------------- uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) It seems that the STP simluation approach in this patch significantly improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels bases on next-20240909 and next-20241104. While there is still room for improvement to reach the throughput of -nop and -ret, the gains are very substantail. But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using single-stepping stp, which are far lower compared to the result when when I submitted patch(look closely to the uprobe-push and uretprobe-push results in commit log). I'm certain that the tests were run on the same bare-metal machine with background tasked minimized. I doubt some uncommitted uprobe optimization on my local repo twist the result of -push using single-step. In addition to the micro benchmark, I also re-ran Redis benchmark to compare the impact of single-stepping STP and simluated STP to the throughput of redis-server. I believe the impact of uprobe on the real application depends on the frequency of uprobe triggered and the application's hot paths. Therefore, I wouldn't say the simluated STP will benefit all real world applications. $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && bpftrace -e 'uprobe:redis-server:readQueryFromClient{} uprobe:redis-server:processCommand{} uprobe:redis-server:aeApiPoll {}' next-20241104 ------------- RPS: 55602.1 next-20241104 + ss stp ---------------------- RPS: 47220.9 uprobe@@aeApiPoll: 554565 uprobe@processCommand: 1275160 uprobe@readQueryFromClient: 1277710 next-20241104 + sim stp ----------------------- RPS 54290.09 uprobe@aeApiPoll: 496007 uprobe@processCommand: 1275160 uprobe@readQueryFromClient: 1277710 Andrii expressed concern that the STP simulation in this patch is too expensive. If we believe the result I re-ran, perhaps it is not a bad way to simluate STP. Looking forward to your feedbacks, or someone could propose a cheaper way to simluate STP, I'm very happy to test it on my machine, thanks. [...] >>> >>> xol-stp >>> ------- >>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) >>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) >>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) >>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) >>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) >>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) >>> >>> simulated-stp >>> ------------- >>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) >>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) >>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) >>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) >>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) >>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) >>> >>> The profiling results based on the upstream kernel with spinlock >>> optimization patches [2] reveals the simulation of STP increase the >>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and >>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). >>> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/ >>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ >>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/ >>> >>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> >>> --- >>> arch/arm64/include/asm/insn.h | 1 + >>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ >>> arch/arm64/kernel/probes/decode-insn.h | 1 + >>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ >>> arch/arm64/kernel/probes/simulate-insn.h | 1 + >>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ >>> arch/arm64/lib/insn.c | 5 ++ >>> 7 files changed, 134 insertions(+) >>> > > [...] -- BR Liao, Chang