On Thu, Nov 7, 2024 at 7:13 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > > > > 在 2024/11/7 3:45, Andrii Nakryiko 写道: > > On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@xxxxxxxxxx> wrote: > >> > >> Andrii and Mark. > >> > >> 在 2024/10/26 4:51, Andrii Nakryiko 写道: > >>> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@xxxxxxx> wrote: > >>>> > >>>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: > >>>>> This patch is the second part of a series to improve the selftest bench > >>>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > >>>>> significantly impact uprobe/uretprobe performance at function entry in > >>>>> most user cases. Profiling results below reveals the STP that executes > >>>>> in the xol slot and trap back to kernel, reduce redis RPS and increase > >>>>> the time of string grep obviously. > >>>>> > >>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > >>>>> > >>>>> Redis GET (higher is better) > >>>>> ---------------------------- > >>>>> No uprobe: 49149.71 RPS > >>>>> Single-stepped STP: 46750.82 RPS > >>>>> Emulated STP: 48981.19 RPS > >>>>> > >>>>> Redis SET (larger is better) > >>>>> ---------------------------- > >>>>> No uprobe: 49761.14 RPS > >>>>> Single-stepped STP: 45255.01 RPS > >>>>> Emulated stp: 48619.21 RPS > >>>>> > >>>>> Grep (lower is better) > >>>>> ---------------------- > >>>>> No uprobe: 2.165s > >>>>> Single-stepped STP: 15.314s > >>>>> Emualted STP: 2.216s > >>>> > >>>> The results for grep are concerning. > >>>> > >>>> In theory, the overhead for stepping should be roughly double the > >>>> overhead for emulating, assuming the exception-entry and > >>>> exception-return are the dominant cost. The cost of stepping should be > >>>> trivial. > >>>> > >>>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while > >>>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x > >>>> more expensive. > >>>> > >>>> Was this tested bare-metal, or in a VM? > >>> > >>> Hey Mark, I hope Liao will have a chance to reply, I don't know the > >>> details of his benchmarking. But I can try to give you my numbers and > >>> maybe answer a few questions, hopefully that helps move the > >>> conversation forward. > >>> > >>> So, first of all, I did a quick benchmark on bare metal (without > >>> Liao's optimization, though), here are my results: > >>> > >>> uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) > >>> uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) > >>> uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) > >>> > >>> uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) > >>> uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) > >>> uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) > >>> > >>> uretprobes are inherently slower, so I'll just compare uprobe, as the > >>> differences are very clear either way. > >>> > >>> -nop is literally nop (Liao solved that issue, I just don't have his > >>> patch applied on my test machine). -push has `stp x29, x30, [sp, > >>> #-0x10]!` instruction traced. -ret is literally just `ret` > >>> instruction. > >>> > >>> So you can see that -ret is almost twice as fast as the -push variant > >>> (it's a microbenchmark, yes, but still). > >>> > >>>> > >>>> AFAICT either: > >>>> > >>>> * Single-stepping is unexpectedly expensive. > >>>> > >>>> Historically we had performance issues with hypervisor trapping of > >>>> debug features, and there are things we might be able to improve in > >>>> the hypervisor and kernel, which would improve stepping *all* > >>>> instructions. > >>>> > >>> > >>> Single-stepping will always be more expensive, as it necessitates > >>> extra hop kernel->user space->kernel, so no matter the optimization > >>> for single-stepping, if we can avoid it, we should. It will be > >>> noticeable. > >>> > >>>> If stepping is the big problem, we could move uprobes over to a BRK > >>>> rather than a single-step. That would require require updating and > >>>> fixing the logic to decide which instructions are steppable, but > >>>> that's necessary anyway given it has extant soundness issues. > >>> > >>> I'm afraid I don't understand what BRK means and what are the > >>> consequences in terms of overheads. I'm not an ARM person either, so > >>> sorry if that's a stupid question. But either way, I can't address > >>> this. But see above, emulating an instruction feels like a much better > >>> approach, if possible. > >> > >> As I understand, Mark's suggestion is to place a BRK instruction next to > >> the instruction in the xol slot. Once the instruction in the xol slot > >> executed, the BRK instruction would trigger a trap into kernel. This is > >> a common technique used on platforms that don't support hardware single- > >> step. However, since Arm64 does support hardware single-stepping, kernel > >> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel > >> after instruction in xol slot executed. But even we move uprobes over > >> to a BRK rather than a single-step. It can't reduce the overhead of user-> > >> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, > >> could you give more details about the BRK. > >> > > > > I see, thanks for elaborating. So the suggestion was to go from very > > expensive single-stepping mode to still expensive breakpoint-based > > kernel->user->kernel workflow. > > > > I think either way it's going to be much slower than avoiding > > kernel->user->kernel hop, so we should emulate STP instead, yep. > > Exactly, in most cases, simluation is the better option. > > > > >>> > >>>> > >>>> * XOL management is absurdly expensive. > >>>> > >>>> Does uprobes keep the XOL slot around (like krpobes does), or does it > >>>> create the slot afresh for each trap? > >>> > >>> XOL *page* is created once per process, lazily, and then we just > >>> juggle a bunch of fixed slots there for each instance of > >>> single-stepped uprobe. And yes, there are some bottlenecks in XOL > >>> management, though it's mostly due to lock contention (as it is > >>> implemented right now). Liao and Oleg have been improving XOL > >>> management, but still, avoiding XOL in the first place is the much > >>> preferred way. > >>> > >>>> > >>>> If that's trying to create a slot afresh for each trap, there are > >>>> several opportunities for improvement, e.g. keep the slot around for > >>>> as long as the uprobe exists, or pre-allocate shared slots for common > >>>> instructions and use those. > >>> > >>> As I mentioned, a XOL page is allocated and mapped once, but yes, it > >>> seems like we dynamically get a slot in it for each single-stepped > >>> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's > >>> probably not a bad idea to just cache and hold a XOL slot for each > >>> specific uprobe, I don't see why we should limit ourselves to just one > >>> XOL page. We also don't need to pre-size each slot, we can probably > >>> allocate just the right amount of space for a given uprobe. > >>> > >>> All good ideas for sure, we should do them, IMO. But we'll still be > >>> paying an extra kernel->user->kernel switch, which almost certainly is > >>> slower than doing a simple stack push emulation just like we do in > >>> x86-64 case, no? > >>> > >>> > >>> BTW, I did a quick local profiling run. I don't think XOL management > >>> is the main source of overhead. I see 5% of CPU cycles spent in > >>> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > >>> traces. There are at least 22% CPU cycles spent in some > >>> local_daif_restore function, though, not sure what that is, but might > >>> be related to interrupt handling, right? > >> > >> The local_daif_restore() is part of the path for all user->kernel->user > >> context switch, including interrupt handling, breakpoints, and single-stepping > >> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't > >> been enable to reproduce this on my local machine. > >> > >> Andrii, could you use the patch below to see if it can reduce the 5% of > >> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache > >> synchronization is the cause of this part of overhead. > >> > >> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@xxxxxxxxxx/ > > > > tbh, I think pre-allocating and setting up fixed XOL slots instead of > > dynamically allocating them is the way to go. We can allocate as many > > special "[uprobes]" pages as necessary to accommodate all the > > single-stepped uprobes in XOL, remember their index, etc. I think > > that's much more performant (and simpler, IMO) approach overall. All > > this preparation, however expensive it might be, will be done once per > > each attached/detached uprobe/uretprobe, which is the place where we > > want to do expensive stuff. Not when uprobe/uretprobe is actually > > triggered. > > Generally agreed. But I have two concerns about pre-allocating of XOL slots: > > 1. If some uprobes/uretprobes are rarely or never triggered, pre-allocating > slots for them seem wastful. However, it isn't a issue on machines with > ample memory(e.g., hundreds of GB or a couple of TB). I'm not too worried here because a) hopefully most of uprobes won't need XOL and b) if user attaches to lots of uprobes and needs lots of XOL slots, then machine is assumed to be powerful enough to handle that. If we run out of memory, then that machine shouldn't be uprobed that extensively. Generally, I think this is not a big concern. But keep in mind, if we go this way, I think we need to make XOL slots as small and exactly sized as necessary to fit just enough code to handle single-stepping for that particular uprobe. Currently we size them to 128 bytes to make it easy to dynamically allocate them in the hot path. With pre-allocating XOL we have different tradeoffs and I'd allocate them as tightly as possible, because attach/detach can be significantly slower than uprobe/uretprobe triggering. > > 2. Currently, threads that trigger the same uprobe/uretprobe are dynamically > allocated different slots. Since we can't predict how many threads will > trigger the same uprobe/uretprobe. it can't pre-allocate enough slots for > per each uprobe/uretprobe. If you allow all threads to share the slot > associated with each uprobe/uretprobe. this would result in XOL pages being > shared across procecces, I think it would introduce significant changes to > the page management of xol_area and fault handling. Yes, I was thinking that all threads will share the same XOL slot, why would you want that per-thread? And yes, it's a pretty significant change, but I think it's a good one. > > > > >> > >>> > >>> > >>> The take away I'd like to communicate here is avoiding the > >>> single-stepping need is *the best way* to go, IMO. So if we can > >>> emulate those STP instructions for uprobe *cheaply*, that would be > >>> awesome. > >> > >> Given some significant uprobe optimizations from Oleg and Andrii > >> merged, I am curious to see how these changes impact the profiling > >> result on Arm64. So I re-ran the selftest bench on the latest kernel > >> (based on tag next-20241104) and the kernel (based on tag next-20240909) > >> that I used when I submitted this patch. The results re-ran are shown > >> below. > >> > >> next-20240909(xol stp + xol nop) > >> -------------------------------- > >> uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) > >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) > >> uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) > >> > >> next-20240909(sim stp + sim nop) > >> -------------------------------- > >> uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) > >> uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > >> uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) > >> uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) > >> > >> next-20241104 (xol stp + sim nop) > >> --------------------------------- > >> uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) > >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) > >> > >> next-20241104 (sim stp + sim nop) > >> --------------------------------- > >> uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > >> uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) > >> > >> It seems that the STP simluation approach in this patch significantly > >> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) > >> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels > >> bases on next-20240909 and next-20241104. While there is still room > >> for improvement to reach the throughput of -nop and -ret, the gains > >> are very substantail. > >> > >> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using > >> single-stepping stp, which are far lower compared to the result when > >> when I submitted patch(look closely to the uprobe-push and uretprobe-push > >> results in commit log). I'm certain that the tests were run on the > >> same bare-metal machine with background tasked minimized. I doubt some > >> uncommitted uprobe optimization on my local repo twist the result of > >> -push using single-step. > > > > You can always profiler and compare before/after, right? See where > > added costs are coming from? > > Yes, I've been looking through the git reflog to restore the > kernel tree to the state when I submitted this patch. > > Regarding the throughtput data of uprobe/uretprobe-push using > single-stepping, I believe *the re-ran result provide a more > accurate picture*. If we carefully compare the throughput of > -nop and -push, we can see that they are very close (uprobe-nop > is 0.424M/s/cpu and uprobe-push is 0.415M/s/cpu, uretprobe-nop > is 0.347M/s/cpu and upretprobe-push is 0.349M/s/cpu). This is > expected, as both use single-stepping to execute NOP and STP. > There's no reason -push using single-step should outperform the > one of -nop(uprobe-push is 0.868M/s/cpu a month ago). > > In summary, understanding these performance is crucial for selecting > an approach that balances accuracy and efficiency. Although this > patch offers lower gain compared to my initial implementation. > But, considering the complexity of 'STP', the cost seems acceptable, > right? But why do the suboptimal thing if we can do better? > > > > >> > >> In addition to the micro benchmark, I also re-ran Redis benchmark to > >> compare the impact of single-stepping STP and simluated STP to the > >> throughput of redis-server. I believe the impact of uprobe on the real > >> application depends on the frequency of uprobe triggered and the application's > >> hot paths. Therefore, I wouldn't say the simluated STP will benefit all > >> real world applications. > > > > It will benefit *a lot* of real world applications, though, so I think > > it's very important to improve. > > Good to see that. > > > > >> > >> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET > >> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && > >> bpftrace -e 'uprobe:redis-server:readQueryFromClient{} > >> uprobe:redis-server:processCommand{} > >> uprobe:redis-server:aeApiPoll {}' > >> > >> next-20241104 > >> ------------- > >> RPS: 55602.1 > >> > >> next-20241104 + ss stp > >> ---------------------- > >> RPS: 47220.9 > >> uprobe@@aeApiPoll: 554565 > >> uprobe@processCommand: 1275160 > >> uprobe@readQueryFromClient: 1277710 > >>>> next-20241104 + sim stp > >> ----------------------- > >> RPS 54290.09 > >> uprobe@aeApiPoll: 496007 > >> uprobe@processCommand: 1275160 > >> uprobe@readQueryFromClient: 1277710 > >> > >> Andrii expressed concern that the STP simulation in this patch is too > >> expensive. If we believe the result I re-ran, perhaps it is not a > >> bad way to simluate STP. Looking forward to your feedbacks, or someone > >> could propose a cheaper way to simluate STP, I'm very happy to test it > >> on my machine, thanks. > > > > I'm no ARM64 expert, but seeing that we simulate stack pushes with > > just memory reads/write for x86-64, it feels like that should be > > satisfactory for ARM64. So I'd suggest you to go back to the initial > > implementation, clean it up, rebase, re-benchmark, and send a new > > revision. Let's continue the discussion there? > > I've re-benchmarked the initial implementation on the latest kernel > tree (tag next-20241104). I paste the results for single-stepping, > the initial patch and this patch here for comparision. Please see > the results below: > > next-20241104 + xol stp > ----------------------- > uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) > > next-20221104 + initial patch > ----------------------------- > uprobe-push ( 1 cpus): 1.798 ± 0.001M/s ( 1.798M/s/cpu) > uretprobe-push ( 1 cpus): 0.806 ± 0.001M/s ( 0.806M/s/cpu) > > next-20241104 + this patch > -------------------------- > uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) > uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) > > As shown in the benchmark results, the initial implementation offers > a limited performance advantage, especailly, it comes at the cost of > accuracy. > What limited accuracy? And since when almost 30% (1.4M/s -> 1.8M/s) is a "limited performance advantage"? > > > >> > >> [...] > >> > >>>>> > >>>>> xol-stp > >>>>> ------- > >>>>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > >>>>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > >>>>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > >>>>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > >>>>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > >>>>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > >>>>> > >>>>> simulated-stp > >>>>> ------------- > >>>>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > >>>>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > >>>>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > >>>>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > >>>>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > >>>>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > >>>>> > >>>>> The profiling results based on the upstream kernel with spinlock > >>>>> optimization patches [2] reveals the simulation of STP increase the > >>>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > >>>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > >>>>> > >>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/ > >>>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > >>>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/ > >>>>> > >>>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> > >>>>> --- > >>>>> arch/arm64/include/asm/insn.h | 1 + > >>>>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > >>>>> arch/arm64/kernel/probes/decode-insn.h | 1 + > >>>>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > >>>>> arch/arm64/kernel/probes/simulate-insn.h | 1 + > >>>>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > >>>>> arch/arm64/lib/insn.c | 5 ++ > >>>>> 7 files changed, 134 insertions(+) > >>>>> > >>> > >>> [...] > >> > >> -- > >> BR > >> Liao, Chang > >> > > -- > BR > Liao, Chang >