Re: [PATCH bpf-next v5 1/6] arm64: ftrace: Add ftrace direct call support

Florent Revest <revest@xxxxxxxxxxxx> · Tue, 9 Aug 2022 19:03:52 +0200

On Thu, Jun 9, 2022 at 6:27 AM Xu Kuohai <xukuohai@xxxxxxxxxx> wrote:
> On 6/7/2022 12:35 AM, Mark Rutland wrote:
> > On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
> >> On 5/26/2022 6:06 PM, Mark Rutland wrote:
> >>> On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
> >>>> On 5/25/2022 9:38 PM, Mark Rutland wrote:
> >>>>> On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
> >>>>>> As noted in that thread, I have a few concerns which equally apply here:
> >>>>>
> >>>>> * Due to the limited range of BL instructions, it's not always possible to
> >>>>>   patch an ftrace call-site to branch to an arbitrary trampoline. The way this
> >>>>>   works for ftrace today relies upon knowingthe set of trampolines at
> >>>>>   compile-time, and allocating module PLTs for those, and that approach cannot
> >>>>>   work reliably for dynanically allocated trampolines.
> >>>>
> >>>> Currently patch 5 returns -ENOTSUPP when long jump is detected, so no
> >>>> bpf trampoline is constructed for out of range patch-site:
> >>>>
> >>>> if (is_long_jump(orig_call, image))
> >>>>    return -ENOTSUPP;
> >>>
> >>> Sure, my point is that in practice that means that (from the user's PoV) this
> >>> may randomly fail to work, and I'd like something that we can ensure works
> >>> consistently.
> >>>
> >>
> >> OK, should I suspend this work until you finish refactoring ftrace?
> >
> > Yes; I'd appreciate if we could hold on this for a bit.
> >
> > I think with some ground work we can avoid most of the painful edge cases and
> > might be able to avoid the need for custom trampolines.
> >
>
> I'v read your WIP code, but unfortunately I didn't find any mechanism to
> replace bpf trampoline in your code, sorry.
>
> It looks like bpf trampoline and ftrace works can be done at the same
> time. I think for now we can just attach bpf trampoline to bpf prog.
> Once your ftrace work is done, we can add support for attaching bpf
> trampoline to regular kernel function. Is this OK?

Hey Mark and Xu! :)

I'm interested in this feature too and would be happy to help.

I've been trying to understand what you both have in mind to figure out a way
forward, please correct me if I got anything wrong! :)

It looks like, currently, there are three places where an indirection to BPF is
technically possible. Chronologically these are:

- the function's patchsite (currently there are 2 nops, this could become 4
  nops with Mark's series on per call-site ops)

- the ftrace ops (currently called by iterating over a global list but could be
  called more directly with Mark's series on per-call-site ops or by
  dynamically generated branches with Wang's series on dynamic trampolines)

- a ftrace trampoline tail call (currently, this is after restoring a full
  pt_regs but this could become an args only restoration with Mark's series on
  DYNAMIC_FTRACE_WITH_ARGS)

If we first consider the situation when only a BPF program is attached to a
kernel function:
- Using the patchsite for indirection (proposed by Xu, same as on x86)
   Pros:
   - We have BPF trampolines anyway because they are required for orthogonal
     features such as calling BPF programs as functions, so jumping into that
     existing JITed code is straightforward
   - This has the minimum overhead (eg: these trampolines only save the actual
     number of args used by the function in ctx and avoid indirect calls)
   Cons:
   - If the BPF trampoline is JITed outside BL's limits, attachment can
     randomly fail

- Using a ftrace op for indirection (proposed by Mark)
  Pros:
  - BPF doesn't need to care about BL's range, ftrace_caller will be in range
  Cons:
  - The ftrace trampoline would first save all args in an ftrace_regs only for
    the BPF op to then re-save them in a BPF ctx array (as per BPF calling
    convention) so we'd effectively have to do the work of saving args twice
  - BPF currently uses DYNAMIC_FTRACE_WITH_DIRECT_CALLS APIs. Either arm64
    should implement DIRECT_CALLS with... an indirect call :) (that is, the
    arch_ftrace_set_direct_caller op would turn back its ftrace_regs into
    arguments for the BPF trampoline) or BPF would need to use a different
    ftrace API just on arm64 (to define new ops, which, unless if they would be
    dynamically JITed, wouldn't be as performant as the existing BPF
    trampolines)

- Using a ftrace trampoline tail call for indirection (not discussed yet iiuc)
  Pros:
  - BPF also doesn't need to care about BL's range
  - This also leverages the existing BPF trampolines
  Cons:
  - This also does the work of saving/restoring arguments twice
  - DYNAMIC_FTRACE_WITH_DIRECT_CALLS depends on DYNAMIC_FTRACE_WITH_REGS now
    although in practice the registers kept by DYNAMIC_FTRACE_WITH_ARGS
    should be enough to call BPF trampolines

If we consider the situation when both ftrace ops and BPF programs are attached
to a kernel function:
- Using the patchsite for indirection can't solve this

- Using a ftrace op for indirection (proposed by Mark) or using a ftrace
  trampoline tail call as an indirection (proposed by Xu, same as on x86) have
  the same pros & cons as in the BPF only situation except that this time we
  pay the cost of registers saving twice for good reasons (we need args in both
  ftrace_regs and the BPF ctx array formats anyway)

Unless I'm missing something, it sounds like the following approach would work:
- Always patch patchsites with calls to ftrace trampolines (within BL ranges)
- Always go through ops and have arch_ftrace_set_direct_caller set
  ftrace_regs->direct_call (instead of pt_regs->orig_x0 in this patch)
- If ftrace_regs->direct_call != 0 at the end of the ftrace trampoline, tail
  call it

Once Mark's series on DYNAMIC_FTRACE_WITH_ARGS is merged, we would need to have
DYNAMIC_FTRACE_WITH_DIRECT_CALLS
  depend on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS
BPF trampolines (the only users of this API now) only care about args to the
attachment point anyway so I think this would work transparently ?

Once Mark's series on per-callsite ops is merged, the second step (going
through ops) would be significantly faster in the situation where only one
program is used, therefore one arch_ftrace_set_direct_caller op.

Once Wang's series on dynamic trampolines is merged, the second step (going
through ops) would also be significantly faster in the case when multiple ops
are attached.

What are your thoughts? If this sounds somewhat sane, I'm happy to help out
with the implementation as well :)

Thanks!
Florent