On Thu, Nov 14, 2024 at 03:44:20PM -0800, Andrii Nakryiko wrote: > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > > Putting together all the previously added pieces to support optimized > > uprobes on top of 5-byte nop instruction. > > > > The current uprobe execution goes through following: > > - installs breakpoint instruction over original instruction > > - exception handler hit and calls related uprobe consumers > > - and either simulates original instruction or does out of line single step > > execution of it > > - returns to user space > > > > The optimized uprobe path > > > > - checks the original instruction is 5-byte nop (plus other checks) > > - adds (or uses existing) user space trampoline and overwrites original > > instruction (5-byte nop) with call to user space trampoline > > - the user space trampoline executes uprobe syscall that calls related uprobe > > consumers > > - trampoline returns back to next instruction > > > > This approach won't speed up all uprobes as it's limited to using nop5 as > > original instruction, but we could use nop5 as USDT probe instruction (which > > uses single byte nop ATM) and speed up the USDT probes. > > As discussed offline, it's not as simple as just replacing nop1 with > nop5 in USDT definition due to performance considerations on old > kernels (nop5 isn't fast as far as uprobe is concerned), but I think > we'll be able to accommodate transparent "nop1 or nop5" behavior in > user space, we'll just need a careful backwards compatible extension > to USDT definition. > > BTW, do you plan to send an optimization for nop5 to avoid > single-stepping them? Ideally we'd just handle any-sized nop, so we > don't have to do this "nop1 or nop5" acrobatics in the future. I'll prepare that, but I'd like to check on the alternative calls you suggested first > > > > > This patch overloads related arch functions in uprobe_write_opcode and > > set_orig_insn so they can install call instruction if needed. > > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after > > first uprobe hit. I originally had it called on uprobe installation but then > > it clashed with elf loader, because the user space trampoline was added in a > > place where loader might need to put elf segments, so I decided to do it after > > first uprobe hit when loading is done. > > fun... ideally we wouldn't do this lazily, I just came up with another > possible idea, but let's keep all this discussion to another thread > with Peter > > > > > TODO release uprobe trampoline when it's no longer needed.. we might need to > > stop all cpus to make sure no user space thread is in the trampoline.. or we > > might just keep it, because there's just one 4GB memory region? > > 4KB not 4GB, right? Yeah, probably leaving it until process exists is > totally fine. yep, ok SNIP > > +#include <asm/nops.h> > > > > /* Post-execution fixups. */ > > > > @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = { > > .emulate = push_emulate_op, > > }; > > > > +static int is_nop5_insns(uprobe_opcode_t *insn) > > insns -> insn? > > > +{ > > + return !memcmp(insn, x86_nops[5], 5); > > +} > > + > > +static int is_call_insns(uprobe_opcode_t *insn) > > ditto, insn, singular? ok SNIP > > +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > > + uprobe_opcode_t *new_opcode, void *opt) > > +{ > > + if (opt) { > > + uprobe_opcode_t old_opcode[5]; > > + bool is_call; > > + > > + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); > > + is_call = is_call_insns((uprobe_opcode_t *) &old_opcode); > > + > > + if (is_call_insns(new_opcode)) { > > + if (is_call) /* register: already installed? */ > > probably should check that the destination of the call instruction is > what we expect? ok SNIP > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > > +{ > > + unsigned long delta; > > + > > + /* call instructions size */ > > + vaddr += 5; > > + delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp; > > + return delta < 0xffffffff; > > isn't immediate a sign extended 32-bit value (that is, int)? wouldn't > this work and be correct: > > long delta = (long)(vaddr + 5 - vtramp); > return delta >= INT_MIN && delta <= INT_MAX; > > ? ah, right.. should be sign value :-\ thanks jirka