Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64

Florent Revest <revest@xxxxxxxxxxxx> · Tue, 4 Oct 2022 18:06:16 +0200

On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@xxxxxxxxxx> wrote:
>
> On 9/29/2022 12:42 AM, Mark Rutland wrote:
> > On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> >> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> >>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> >>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> >>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> >>>>>> This series adds ftrace direct call for arm64, which is required to attach
> >>>>>> bpf trampoline to fentry.
> >>>>>>
> >>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
> >>>>>> no patch has been posted except the one I posted in [1], so this series

Hey Xu :) Sorry I wasn't more pro-active about communicating what i
was experimenting with! A lot of conversations happened off-the-list
at LPC and LSS so I was playing on the side with the ideas that got
suggested to me. I start to have a little something to share.
Hopefully if we work closer together now we can get quicker results.

> >>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
> >>>>>> direct call works regardless of the distance between the callsite and custom
> >>>>>> trampoline.
> >>>>>>
> >>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@xxxxxxxxxx/
> >>>>>>
> >>>>>> v2:
> >>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@xxxxxxxxxxxxxxx/
> >>>>>>
> >>>>>> Xu Kuohai (4):
> >>>>>>      ftrace: Allow users to disable ftrace direct call
> >>>>>>      arm64: ftrace: Support long jump for ftrace direct call
> >>>>>>      arm64: ftrace: Add ftrace direct call support
> >>>>>>      ftrace: Fix dead loop caused by direct call in ftrace selftest
> >>>>>
> >>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> >>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> >>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> >>>>> could work too, but I'd presume this just results in merge conflicts)?
> >>>>
> >>>> I think it makes sense for the series to go via the arm64 tree but I'd
> >>>> like Mark to have a look at the ftrace changes first.
> >>>
> >>>>  From a quick scan, I still don't think this is quite right, and as it stands I
> >>> believe this will break backtracing (as the instructions before the function
> >>> entry point will not be symbolized correctly, getting in the way of
> >>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> >>> feedback there, as I have a mechanism in mind that wa a little simpler.
> >>
> >> Thanks for the review. I have some thoughts about reliable stacktrace.
> >>
> >> If PC is not in the range of literal_call, stacktrace works as before without
> >> changes.
> >>
> >> If PC is in the range of literal_call, for example, interrupted by an
> >> irq, I think there are 2 problems:
> >>
> >> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> >>     will be missing from the backtrace.
> >>
> >> 2. Since PC is not in func's address range, no symbol name will be found, so
> >>     func name is also missing.
> >>
> >> Problem 1 is not introduced by this patchset, but the occurring probability
> >> may be increased by this patchset. I think this problem should be addressed by
> >> a reliable stacktrace scheme, such as ORC on x86.
> >
> > I agree problem 1 is not introduced by this patch set; I have plans fo how to
> > address that for reliable stacktrace based on identifying the ftrace
> > trampoline. This is one of the reasons I do not want direct calls, as
> > identifying all direct call trampolines is going to be very painful and slow,
> > whereas identifying a statically allocated ftrace trampoline is far simpler.
> >
> >> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> >> ways to deal with it:
> >
> > What I would like to do here, as mentioned previously in other threads, is to
> > avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> > each patch-site with a specific set of ops, and invoke that directly from the
> > regular ftrace trampoline.
> >
> > With that, the patch site would look like:
> >
> >       pre_func_literal:
> >               NOP             // Patched to a pointer to
> >               NOP             // ftrace_ops
> >       func:
> >               < optional BTI here >
> >               NOP             // Patched to MOV X9, LR
> >               NOP             // Patched to a BL to the ftrace trampoline
> >
> > ... then in the ftrace trampoline we can recover the ops pointer at a negative
> > offset from the LR based on the LR, and invoke the ops from there (passing a
> > struct ftrace_regs with the saved regs).
> >
> > That way the patch-site is less significantly affected, and there's no impact
> > to backtracing. That gets most of the benefit of the direct calls avoiding the
> > ftrace ops list traversal, without having to do anything special at all. That
> > should be much easier to maintain, too.
> >
> > I started implementing that before LPC (and you can find some branches on my
> > kernel.org repo), but I haven't yet had the time to rebase those and sort out
> > the remaining issues:
> >
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> >
>
> I've read this code before, but it doesn't run and since you haven't updated

I also tried to use this but indeed the "TODO: mess with protection to
set this" in  5437aa788d needs to be addressed before we can use it.

> it, I assumed you dropped it :(
>
> This approach seems appropriate for dynamic ftrace trampolines, but I think
> there are two more issues for bpf.
>
> 1. bpf trampoline was designed to be called directly from fentry (located in
>     kernel function or bpf prog). So to make it work as ftrace_op, we may end
>     up with two different bpf trampoline types on arm64, one for bpf prog and
>     the other for ftrace.
>
> 2. Performance overhead, as we always jump to a static ftrace trampoline to
>     construct execution environment for bpf trampoline, then jump to the bpf
>     trampoline to construct execution environment for bpf prog, then jump to
>     the bpf prog, so for some small bpf progs or hot functions, the calling
>     overhead may be unacceptable.

>From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
(all in CC) would like to see an ftrace ops based solution (or rather,
something that doesn't require direct calls) for invoking BPF tracing
programs. I figured that the best way to move forward on the question
of whether the performance impact of that would be acceptable or not
is to just build it and measure it. I understand you're testing your
work on real hardware (I work on an emulator at the moment) , would
you be able to compare the impact of my proof of concept branch with
your direct call based approach ?

https://github.com/FlorentRevest/linux/commits/fprobe-min-args

I first tried to implement this as an ftrace op myself but realized I
was re-implementing a lot of the function graph tracer. So I then
tried to use the function graph tracer API but realized I was missing
some features which Steven had addressed in an RFC few years back. So
I rebuilt on that until I realized Masami has been upstreaming the
fprobe and rethook APIs as spiritual successors of Steven's RFC... So
I've now rebuilt yet another proof of concept based on fprobe and
rethook.

That branch is still very much WIP and there are a few things I'd like
to address before sending even an RFC (when kretprobe is built on
rethook for example, I construct pt_regs on the stack in which I copy
the content of ftrace_regs... or program linking/unlinking is racy
right now) but I think it's good enough for performance measurements
already. (fentry_fexit and lsm tests pass)

> > Note that as a prerequisite for that I also want to reduce the set of registers
> > we save/restore down to the set required by our calling convention, as the
> > existing pt_regs is both large and generally unsound (since we can not and do
> > not fill in many of the fields we only acquire at an exception boundary).
> > That'll further reduce the ftrace overhead generally, and remove the needs for
> > the two trampolines we currently have. I have a WIP at:
> >
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs

Note that I integrated this work to my branch too. I extended it to
also have fprobe and rethook save and pass ftrace_regs structures to
their callbacks. Most performance improvements would come from your
arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
TODO for it to work.

> > I intend to get back to both of those shortly (along with some related bits for
> > kretprobes and stacktracing); I just haven't had much time recently due to
> > other work and illness.
> >
>
> Sorry for that, hope you getting better soon.

Oh, that sucks. Get better Mark!