Hi! I did some basic experiment on bpftime, which combined user space trampoline in bpftime with a bpf_prog_test_run syscall to run eBPF code in kernel. In my laptop, it was about 2-3x faster than original trap based Uprobe. The experiment code was in https://github.com/eunomia-bpf/bpftime/blob/71f13ae80e93e8ff45e1b0320c25ff14cb25b4ba/runtime/src/bpftime_prog.cpp#L113 (That's just a poc, not kernel patches) On Fri, Mar 1, 2024 at 5:27 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > two traps in worst case scenario or single trap if the original > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > trap on top of that. > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > user space trampoline that: > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > some numbers by the time the conference starts? This should inform the > > > > decision whether it even makes sense to go through all the trouble. > > > > > > right, will do that > > > > I believe Yusheng measured syscall vs uprobe performance > > difference during LPC. iirc it was something like 3x. > > Do you have a link to slides? Was it actual uprobe vs just some fast > syscall (not doing BPF program execution) comparison? Or comparing the > performance of int3 handling vs equivalent syscall handling. > > I suspect it's the former, and so probably not that representative. > I'm curious about the performance of going > userspace->kernel->userspace through int3 vs syscall (all other things > being equal). > > > Certainly necessary to have a benchmark. > > selftests/bpf/bench has one for uprobe. > > Probably should extend with sys_bpf. > > > > Regarding: > > > replace the normal uprobe trap instruction with jump to > > user space trampoline > > > > it should probably be a call to trampoline instead of a jump. > > Unless you plan to generate a different trampoline for every location ? > > > > Also how would you pick a space for a trampoline in the target process ? > > Analyze /proc/pid/maps and look for gaps in executable sections? > > kernel already does that for uretprobes, it adds a new "[uprobes]" > memory mapping, so this part is already implemented > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > and explicit single trampoline for all USDT locations > > that saves all (callee and caller saved) registers and > > then does sys_bpf with a new cmd. > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > replace 1st byte to make an actual call insn. > > > > Once patched there will be no simulation of insns or kernel traps. > > Just normal user code that calls into trampoline, that calls sys_bpf, > > and returns back.