Hi Florent, On Tue, 8 Nov 2022 23:06:50 +0100 Florent Revest <revest@xxxxxxxxxxxx> wrote: > Hi! > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on > arm64. > > Current status of BPF tracing > ============================= > > On currently supported architectures (like x86), BPF tracing programs are > called from a JITted BPF trampoline, itself called from the ftrace patch site > thanks to the ftrace "direct call" API. (or from the end of the ftrace > trampoline if a ftrace ops is also tracing that function, but this is > transparent to BPF) > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be > used for struct ops programs already), but Xu's attempts at getting ftrace > direct calls support [2][3] on arm64 have been unsucessful so far so we still > do not support BPF tracing programs. This prompted me to try a different > approach. I'd like to collect feedback on it here. > > Why not direct calls ? > ====================== > > Mark and Steven have not been too keen on getting direct calls on arm64 because: > - working around BL instruction's limited range introduces complexity [4] > - it's difficult to get reliable stacktraces right with direct calls [5] > - direct calls are complex to maintain on the arch/ftrace side [5] > > In the absence of ftrace direct calls support, BPF tracing programs would need > to be called from an ftrace ops instead. Note that the BPF callback signature > would have to be different, so we can't re-use trampolines (direct called > callbacks receive arguments in registers whereas ftrace ops callbacks receive > arguments in a struct ftrace_regs pointer) > > Why fprobe ? > ============ > > Ftrace ops per-se only expose an API to hook before a function. There are two > systems built on top of ftrace ops that also allow hooking the function exit: > fprobe (using rethook) and the function graph tracer. There are plans from > Masami and Steven to unify these two systems but, as they stand, only fprobe > gives enough flexibility to implement BPF tracing. > > In order not to reinvent the wheel, if direct calls aren't available on the > arch, BPF could leverage fprobe to hook before and after the traced function. > Note that return hooking is implemented a bit differently than it is in BPF > trampolines. Instead of keeping arguments on a stack frame and calling the > traced function, rethook saves arguments in a memory pool and returns to the > traced function with a hijacked return pointer that will have its ret jump back > to the rethook trampoline. Yeah, that is designed to not change the task's stack, but it makes another list-based stack. But it eventually replaced by the per-task shadow stack. > > What about performances ? > ========================= > > In its current state, a fprobe callback on arm64 is very expensive because: > 1- the ftrace trampoline saves all registers (including many unnecessary ones) > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow > 3- the fprobe ops unconditionally hooks a rethook > 4- rethook grabs memory from a freelist which is slow under high contention > > However, all the above points are currently being addressed: > 1- by Mark's series to save argument registers only [6] > 2- by Mark's series to call single ops directly [7] > 3- by Masami's patch to skip rethooks if not needed [8] > 4- Masami said the rethook freelist would be replaced by a per-task stack as > part of its unification with the function graph tracer [9] > > I measured the costs of BPF on different approaches on my RPi4 here: [10] > tl;dr: the BPF "bench" takes a performance hit of: > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] > - 66.8% w/ BPF on kprobe (just for reference) > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] Great! thanks for measuring that. I've checked your tree[13] and basically it is good for me. - rethook: fprobe: Use ftrace_regs instead of pt_regs This patch may break other arch which is trying to implement rethook, so can you break it down to patches for fprobe, rethook and arch specific one? > > On top of Mark's and Masami's existing and planned work, BPF tracing programs > called from a fprobe callback become much closer to the performances of direct > calls but there's still a hit. If we want to try optimizing even further, we > could potentially go even one step further by JITting the fprobe callbacks. Would this mean JITting fprobe or callbacks? > This would let us unroll the argument loop and use direct calls instead of > indirect calls. However this would introduce quite some complexity and I expect > the performance impact should be fairly minimal. (arm64 doesn't have repotlines > so it doesn't suffer too much from indirect calls anyway) > > Two other performance discrepancies I can think of, that would stay anyway are: > 1- the ftrace trampoline saves all argument registers while BPF trampolines can > get away with only saving the number of arguments that are actually used by > that function (thanks to BTF info), consequentially, we need to convert the > ftrace_regs structure into a BPF "ctx" array which inevitably takes some > cycles Have you checked the root cause by perf? I'm not sure that makes difference so much. > 2- When a fexit program is attached, going through the rethook trampoline means > that we need to save argument registers a second time while BPF trampolines > can just rely on arguments kept on the stack I think we can make a new trampoline eventually just copying some required arguments on the shadow stack. Thanks! > > How to apply this RFC ? > ======================= > > This RFC only brings up to discussion the eventual patch that would > touch the BPF subsystem because the ftrace and fprobe optimizations it > is built on are not as controversial and already on the way. However, > this patch is meant to apply on top of Mark's and Masami's work. If > you'd like to run this patch you can use my branch. [13] > > Links > ===== > > 1: https://lore.kernel.org/bpf/20220711144722.2100039-1-xukuohai@xxxxxxxxxx/ > 2: https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@xxxxxxxxxx/ > 3: https://lore.kernel.org/bpf/20220913162732.163631-1-xukuohai@xxxxxxxxxxxxxxx/ > 4: https://lore.kernel.org/bpf/Yo4xb2w+FHhUtJNw@FVFF77S0Q05N/ > 5: https://lore.kernel.org/bpf/YzR5WSLux4mmFIXg@FVFF77S0Q05N/ > 6: https://lore.kernel.org/all/20221103170520.931305-1-mark.rutland@xxxxxxx/ > 7: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops > 8: https://lore.kernel.org/all/166792255429.919356.14116090269057513181.stgit@devnote3/T/#m9d43fbdc91f48b03d644be77ac18017963a08df5 > 9: https://lore.kernel.org/bpf/20221024220008.48780b0f58903afed2dc8d4a@xxxxxxxxxx/ > 10: https://paste.debian.net/1260011/ > 11: https://github.com/FlorentRevest/linux/commits/bpf-direct-calls > 12: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-slow > 13: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-rfc > > Florent Revest (1): > bpf: Invoke tracing progs using fprobe on archs without direct call > > include/linux/bpf.h | 5 ++ > kernel/bpf/trampoline.c | 120 ++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 121 insertions(+), 4 deletions(-) > > -- > 2.38.1.431.g37b22c650d-goog > -- Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>