On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote: > > > > On 3/5/24 15:53, Song Liu wrote: > > On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > >> > >> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > >>> > >>> > >>> > >>> On 2/29/24 06:39, Jiri Olsa wrote: > >>>> One of uprobe pain points is having slow execution that involves > >>>> two traps in worst case scenario or single trap if the original > >>>> instruction can be emulated. For return uprobes there's one extra > >>>> trap on top of that. > >>>> > >>>> My current idea on how to make this faster is to follow the optimized > >>>> kprobes and replace the normal uprobe trap instruction with jump to > >>>> user space trampoline that: > >>>> > >>>> - executes syscall to call uprobe consumers callbacks > >>>> - executes original instructions > >>>> - jumps back to continue with the original code > >>>> > >>>> There are of course corner cases where above will have trouble or > >>>> won't work completely, like: > >>>> > >>>> - executing original instructions in the trampoline is tricky wrt > >>>> rip relative addressing > >>>> > >>>> - some instructions we can't move to trampoline at all > >>>> > >>>> - the uprobe address is on page boundary so the jump instruction to > >>>> trampoline would span across 2 pages, hence the page replace won't > >>>> be atomic, which might cause issues > >>>> > >>>> - ... ? many others I'm sure > >>>> > >>>> Still with all the limitations I think we could be able to speed up > >>>> some amount of the uprobes, which seems worth doing. > >>> > >>> Just a random idea related to this. > >>> Could we also run jit code of bpf programs in the user space to collect > >>> information instead of going back to the kernel every time? > > > > I was thinking about a similar idea. I guess these user space BPF > > programs will have limited features that we can probably use them > > update bpf maps. For this limited scope, we still need bpf_arena. > > Otherwise, the user space bpf program will need to update the bpf > > maps with sys_bpf(), which adds the same overhead as triggering > > That is true. However, even without bpf_arena, it still works with > some workarounds without going through sys_bpf(). Anything making uprobes faster would be very welcomed for my project. The biggest performance problem for us is the cost of bpf_probe_read_user() relative to raw memory access. Every call to this helper walks the process' page table to check that the access would not cause a fault (I think); this is very slow. I wonder if there's some other option that would keep the safety requirement for the memory access -- I'm imagining an optimistic mode where the raw access is performed (in the target process' memory space) and, in the rare case when a fault happens, the kernel would somehow recover from the fault and fail the bpf_probe_read_user() helper. Would something like that be technically feasible / has there been any prior interest in faster access to user memory? A more limited option that might be helpful would be a vectorized version of bpf_probe_read_user() that verifies many pointers at once. > > > the program with a syscall. > > > >> > >> sorry for late reply, do you mean like ubpf? the scope of this change > >> is to speed up the generic uprobe, ebpf is just one of the consumers > > > > I guess this means we need a new syscall? > > > > Thanks, > > Song >