Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 14, 2024 at 03:44:20PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> >
> > The current uprobe execution goes through following:
> >   - installs breakpoint instruction over original instruction
> >   - exception handler hit and calls related uprobe consumers
> >   - and either simulates original instruction or does out of line single step
> >     execution of it
> >   - returns to user space
> >
> > The optimized uprobe path
> >
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline and overwrites original
> >     instruction (5-byte nop) with call to user space trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> >
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we could use nop5 as USDT probe instruction (which
> > uses single byte nop ATM) and speed up the USDT probes.
> 
> As discussed offline, it's not as simple as just replacing nop1 with
> nop5 in USDT definition due to performance considerations on old
> kernels (nop5 isn't fast as far as uprobe is concerned), but I think
> we'll be able to accommodate transparent "nop1 or nop5" behavior in
> user space, we'll just need a careful backwards compatible extension
> to USDT definition.
> 
> BTW, do you plan to send an optimization for nop5 to avoid
> single-stepping them? Ideally we'd just handle any-sized nop, so we
> don't have to do this "nop1 or nop5" acrobatics in the future.

I'll prepare that, but I'd like to check on the alternative calls
you suggested first

> 
> >
> > This patch overloads related arch functions in uprobe_write_opcode and
> > set_orig_insn so they can install call instruction if needed.
> >
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> 
> fun... ideally we wouldn't do this lazily, I just came up with another
> possible idea, but let's keep all this discussion to another thread
> with Peter
> 
> >
> > TODO release uprobe trampoline when it's no longer needed.. we might need to
> > stop all cpus to make sure no user space thread is in the trampoline.. or we
> > might just keep it, because there's just one 4GB memory region?
> 
> 4KB not 4GB, right? Yeah, probably leaving it until process exists is
> totally fine.

yep, ok

SNIP

> > +#include <asm/nops.h>
> >
> >  /* Post-execution fixups. */
> >
> > @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
> >         .emulate  = push_emulate_op,
> >  };
> >
> > +static int is_nop5_insns(uprobe_opcode_t *insn)
> 
> insns -> insn?
> 
> > +{
> > +       return !memcmp(insn, x86_nops[5], 5);
> > +}
> > +
> > +static int is_call_insns(uprobe_opcode_t *insn)
> 
> ditto, insn, singular?

ok

SNIP

> > +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                             uprobe_opcode_t *new_opcode, void *opt)
> > +{
> > +       if (opt) {
> > +               uprobe_opcode_t old_opcode[5];
> > +               bool is_call;
> > +
> > +               uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> > +               is_call = is_call_insns((uprobe_opcode_t *) &old_opcode);
> > +
> > +               if (is_call_insns(new_opcode)) {
> > +                       if (is_call)            /* register: already installed? */
> 
> probably should check that the destination of the call instruction is
> what we expect?

ok

SNIP

> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +       unsigned long delta;
> > +
> > +       /* call instructions size */
> > +       vaddr += 5;
> > +       delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp;
> > +       return delta < 0xffffffff;
> 
> isn't immediate a sign extended 32-bit value (that is, int)? wouldn't
> this work and be correct:
> 
> long delta = (long)(vaddr + 5 - vtramp);
> return delta >= INT_MIN && delta <= INT_MAX;
> 
> ?

ah, right.. should be sign value :-\ thanks

jirka




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux