On Thu, Jun 9, 2022 at 6:27 AM Xu Kuohai <xukuohai@xxxxxxxxxx> wrote: > On 6/7/2022 12:35 AM, Mark Rutland wrote: > > On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote: > >> On 5/26/2022 6:06 PM, Mark Rutland wrote: > >>> On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote: > >>>> On 5/25/2022 9:38 PM, Mark Rutland wrote: > >>>>> On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote: > >>>>>> As noted in that thread, I have a few concerns which equally apply here: > >>>>> > >>>>> * Due to the limited range of BL instructions, it's not always possible to > >>>>> patch an ftrace call-site to branch to an arbitrary trampoline. The way this > >>>>> works for ftrace today relies upon knowingthe set of trampolines at > >>>>> compile-time, and allocating module PLTs for those, and that approach cannot > >>>>> work reliably for dynanically allocated trampolines. > >>>> > >>>> Currently patch 5 returns -ENOTSUPP when long jump is detected, so no > >>>> bpf trampoline is constructed for out of range patch-site: > >>>> > >>>> if (is_long_jump(orig_call, image)) > >>>> return -ENOTSUPP; > >>> > >>> Sure, my point is that in practice that means that (from the user's PoV) this > >>> may randomly fail to work, and I'd like something that we can ensure works > >>> consistently. > >>> > >> > >> OK, should I suspend this work until you finish refactoring ftrace? > > > > Yes; I'd appreciate if we could hold on this for a bit. > > > > I think with some ground work we can avoid most of the painful edge cases and > > might be able to avoid the need for custom trampolines. > > > > I'v read your WIP code, but unfortunately I didn't find any mechanism to > replace bpf trampoline in your code, sorry. > > It looks like bpf trampoline and ftrace works can be done at the same > time. I think for now we can just attach bpf trampoline to bpf prog. > Once your ftrace work is done, we can add support for attaching bpf > trampoline to regular kernel function. Is this OK? Hey Mark and Xu! :) I'm interested in this feature too and would be happy to help. I've been trying to understand what you both have in mind to figure out a way forward, please correct me if I got anything wrong! :) It looks like, currently, there are three places where an indirection to BPF is technically possible. Chronologically these are: - the function's patchsite (currently there are 2 nops, this could become 4 nops with Mark's series on per call-site ops) - the ftrace ops (currently called by iterating over a global list but could be called more directly with Mark's series on per-call-site ops or by dynamically generated branches with Wang's series on dynamic trampolines) - a ftrace trampoline tail call (currently, this is after restoring a full pt_regs but this could become an args only restoration with Mark's series on DYNAMIC_FTRACE_WITH_ARGS) If we first consider the situation when only a BPF program is attached to a kernel function: - Using the patchsite for indirection (proposed by Xu, same as on x86) Pros: - We have BPF trampolines anyway because they are required for orthogonal features such as calling BPF programs as functions, so jumping into that existing JITed code is straightforward - This has the minimum overhead (eg: these trampolines only save the actual number of args used by the function in ctx and avoid indirect calls) Cons: - If the BPF trampoline is JITed outside BL's limits, attachment can randomly fail - Using a ftrace op for indirection (proposed by Mark) Pros: - BPF doesn't need to care about BL's range, ftrace_caller will be in range Cons: - The ftrace trampoline would first save all args in an ftrace_regs only for the BPF op to then re-save them in a BPF ctx array (as per BPF calling convention) so we'd effectively have to do the work of saving args twice - BPF currently uses DYNAMIC_FTRACE_WITH_DIRECT_CALLS APIs. Either arm64 should implement DIRECT_CALLS with... an indirect call :) (that is, the arch_ftrace_set_direct_caller op would turn back its ftrace_regs into arguments for the BPF trampoline) or BPF would need to use a different ftrace API just on arm64 (to define new ops, which, unless if they would be dynamically JITed, wouldn't be as performant as the existing BPF trampolines) - Using a ftrace trampoline tail call for indirection (not discussed yet iiuc) Pros: - BPF also doesn't need to care about BL's range - This also leverages the existing BPF trampolines Cons: - This also does the work of saving/restoring arguments twice - DYNAMIC_FTRACE_WITH_DIRECT_CALLS depends on DYNAMIC_FTRACE_WITH_REGS now although in practice the registers kept by DYNAMIC_FTRACE_WITH_ARGS should be enough to call BPF trampolines If we consider the situation when both ftrace ops and BPF programs are attached to a kernel function: - Using the patchsite for indirection can't solve this - Using a ftrace op for indirection (proposed by Mark) or using a ftrace trampoline tail call as an indirection (proposed by Xu, same as on x86) have the same pros & cons as in the BPF only situation except that this time we pay the cost of registers saving twice for good reasons (we need args in both ftrace_regs and the BPF ctx array formats anyway) Unless I'm missing something, it sounds like the following approach would work: - Always patch patchsites with calls to ftrace trampolines (within BL ranges) - Always go through ops and have arch_ftrace_set_direct_caller set ftrace_regs->direct_call (instead of pt_regs->orig_x0 in this patch) - If ftrace_regs->direct_call != 0 at the end of the ftrace trampoline, tail call it Once Mark's series on DYNAMIC_FTRACE_WITH_ARGS is merged, we would need to have DYNAMIC_FTRACE_WITH_DIRECT_CALLS depend on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS BPF trampolines (the only users of this API now) only care about args to the attachment point anyway so I think this would work transparently ? Once Mark's series on per-callsite ops is merged, the second step (going through ops) would be significantly faster in the situation where only one program is used, therefore one arch_ftrace_set_direct_caller op. Once Wang's series on dynamic trampolines is merged, the second step (going through ops) would also be significantly faster in the case when multiple ops are attached. What are your thoughts? If this sounds somewhat sane, I'm happy to help out with the implementation as well :) Thanks! Florent