On Tue, Oct 15, 2024 at 5:53 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > Hi Andrii, > > Sorry I excavated this from patchwork. > > On Mon, 29 Apr 2024 15:38:08 -0700 > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > > On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > At the lowest level, rethook-based kretprobes on x86-64 architecture go > > > through arch_rethoook_trampoline() function, manually written in > > > assembly, which calls into a simple arch_rethook_trampoline_callback() > > > function, written in C, and only doing a few straightforward field > > > assignments, before calling further into rethook_trampoline_handler(), > > > which handles kretprobe callbacks generically. > > > > > > Looking at simplicity of arch_rethook_trampoline_callback(), it seems > > > not really worthwhile to spend an extra function call just to do 4 or > > > 5 assignments. As such, this patch proposes to "inline" > > > arch_rethook_trampoline_callback() into arch_rethook_trampoline() by > > > manually implementing it in an assembly code. > > Yeah, I think it is possible, but this makes code ugly, that is > trade-off. As you say, we should move this with other ugly inline > assembly code into kprobe.S or something like it. With my current > fprobe-on-fgraph, rethook is only required for kretprobes, so it > is natual and simple to have kprobe.S. > Alright, I'll wait for fprobe-on-fgraph work to land, and will look at the LBR "wastage" and see what can be done to minimize it. If at that point something like this is still needed, I'll follow up separately. Thanks. > Thank you, > > > > > > > This has two motivations. First, we do get a bit of runtime speed up by > > > avoiding function calls. Using BPF selftests's bench tool, we see > > > 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe > > > triggering code path: > > > > > > BEFORE (latest probes/for-next) > > > =============================== > > > kretprobe : 10.455 ± 0.024M/s > > > kretprobe-multi: 11.150 ± 0.012M/s > > > > > > AFTER (probes/for-next + this patch) > > > ==================================== > > > kretprobe : 10.540 ± 0.009M/s (+0.8%) > > > kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) > > > > > > Second, and no less importantly for some specialized use cases, this > > > avoids unnecessarily "polluting" LBR records with an extra function call > > > (recorded as a jump by CPU). This is the case for the retsnoop ([0]) > > > tool, which relies havily on capturing LBR records to provide users with > > > lots of insight into kernel internals. > > > > > > This RFC patch is only inlining this function for x86-64, but it's > > > possible to do that for 32-bit x86 arch as well and then remove > > > arch_rethook_trampoline_callback() implementation altogether. Please let > > > me know if this change is acceptable and whether I should complete it > > > with 32-bit "inlining" as well. Thanks! > > > > > > [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr > > > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > --- > > > arch/x86/kernel/asm-offsets_64.c | 4 ++++ > > > arch/x86/kernel/rethook.c | 37 +++++++++++++++++++++++++++----- > > > 2 files changed, 36 insertions(+), 5 deletions(-) > > > [...]