On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > two traps in worst case scenario or single trap if the original > > > > instruction can be emulated. For return uprobes there's one extra > > > > trap on top of that. > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > user space trampoline that: > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > Did you get a chance to measure relative performance of syscall vs > > > int3 interrupt handling? If not, do you think you'll be able to get > > > some numbers by the time the conference starts? This should inform the > > > decision whether it even makes sense to go through all the trouble. > > > > right, will do that > > I believe Yusheng measured syscall vs uprobe performance > difference during LPC. iirc it was something like 3x. Do you have a link to slides? Was it actual uprobe vs just some fast syscall (not doing BPF program execution) comparison? Or comparing the performance of int3 handling vs equivalent syscall handling. I suspect it's the former, and so probably not that representative. I'm curious about the performance of going userspace->kernel->userspace through int3 vs syscall (all other things being equal). > Certainly necessary to have a benchmark. > selftests/bpf/bench has one for uprobe. > Probably should extend with sys_bpf. > > Regarding: > > replace the normal uprobe trap instruction with jump to > user space trampoline > > it should probably be a call to trampoline instead of a jump. > Unless you plan to generate a different trampoline for every location ? > > Also how would you pick a space for a trampoline in the target process ? > Analyze /proc/pid/maps and look for gaps in executable sections? kernel already does that for uretprobes, it adds a new "[uprobes]" memory mapping, so this part is already implemented > > We can start simple with a USDT that uses nop5 instead of nop1 > and explicit single trampoline for all USDT locations > that saves all (callee and caller saved) registers and > then does sys_bpf with a new cmd. > > To replace nop5 with a call to trampoline we can use text_poke_bp > approach: replace 1st byte with int3, replace 2-5 with target addr, > replace 1st byte to make an actual call insn. > > Once patched there will be no simulation of insns or kernel traps. > Just normal user code that calls into trampoline, that calls sys_bpf, > and returns back.