On Mon, Nov 18, 2024 at 05:18:08PM +0900, Masami Hiramatsu wrote: > On Tue, 5 Nov 2024 14:34:01 +0100 > Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > Putting together all the previously added pieces to support optimized > > uprobes on top of 5-byte nop instruction. > > > > The current uprobe execution goes through following: > > - installs breakpoint instruction over original instruction > > - exception handler hit and calls related uprobe consumers > > - and either simulates original instruction or does out of line single step > > execution of it > > - returns to user space > > > > The optimized uprobe path > > > > - checks the original instruction is 5-byte nop (plus other checks) > > - adds (or uses existing) user space trampoline and overwrites original > > instruction (5-byte nop) with call to user space trampoline > > - the user space trampoline executes uprobe syscall that calls related uprobe > > consumers > > - trampoline returns back to next instruction > > > > This approach won't speed up all uprobes as it's limited to using nop5 as > > original instruction, but we could use nop5 as USDT probe instruction (which > > uses single byte nop ATM) and speed up the USDT probes. > > > > This patch overloads related arch functions in uprobe_write_opcode and > > set_orig_insn so they can install call instruction if needed. > > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after > > first uprobe hit. I originally had it called on uprobe installation but then > > it clashed with elf loader, because the user space trampoline was added in a > > place where loader might need to put elf segments, so I decided to do it after > > first uprobe hit when loading is done. > > > > TODO release uprobe trampoline when it's no longer needed.. we might need to > > stop all cpus to make sure no user space thread is in the trampoline.. or we > > might just keep it, because there's just one 4GB memory region? > > > > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx> > > --- > > arch/x86/include/asm/uprobes.h | 7 ++ > > arch/x86/kernel/uprobes.c | 130 +++++++++++++++++++++++++++++++++ > > include/linux/uprobes.h | 1 + > > kernel/events/uprobes.c | 3 + > > 4 files changed, 141 insertions(+) > > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h > > index 678fb546f0a7..84a75ed748f0 100644 > > --- a/arch/x86/include/asm/uprobes.h > > +++ b/arch/x86/include/asm/uprobes.h > > @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t; > > #define UPROBE_SWBP_INSN 0xcc > > #define UPROBE_SWBP_INSN_SIZE 1 > > > > +enum { > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0, > > + ARCH_UPROBE_FLAG_OPTIMIZED = 1, > > +}; > > + > > struct uprobe_xol_ops; > > > > struct arch_uprobe { > > @@ -45,6 +50,8 @@ struct arch_uprobe { > > u8 ilen; > > } push; > > }; > > + > > + unsigned long flags; > > }; > > > > struct arch_uprobe_task { > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > > index 02aa4519b677..50ccf24ff42c 100644 > > --- a/arch/x86/kernel/uprobes.c > > +++ b/arch/x86/kernel/uprobes.c > > @@ -18,6 +18,7 @@ > > #include <asm/processor.h> > > #include <asm/insn.h> > > #include <asm/mmu_context.h> > > +#include <asm/nops.h> > > > > /* Post-execution fixups. */ > > > > @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = { > > .emulate = push_emulate_op, > > }; > > > > +static int is_nop5_insns(uprobe_opcode_t *insn) > > +{ > > + return !memcmp(insn, x86_nops[5], 5); > > Maybe better to use BYTES_NOP5 directly? ok > > > +} > > + > > +static int is_call_insns(uprobe_opcode_t *insn) > > +{ > > + return *insn == 0xe8; > > 0xe8 -> CALL_INSN_OPCODE right, thanks jirka