On Wed, Aug 23, 2023 at 5:16 PM Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> wrote: > > diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c > index c60d0d9f1a95..90ad28260a9f 100644 > --- a/kernel/trace/trace_fprobe.c > +++ b/kernel/trace/trace_fprobe.c > +#else /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */ > + > +/* Since fprobe handlers can be nested, pt_regs buffer need to be a stack */ > +#define PERF_FPROBE_REGS_MAX 4 > + > +struct pt_regs_stack { > + struct pt_regs regs[PERF_FPROBE_REGS_MAX]; > + int idx; > +}; > + > +static DEFINE_PER_CPU(struct pt_regs_stack, perf_fprobe_regs); > + > +static __always_inline > +struct pt_regs *perf_fprobe_partial_regs(struct ftrace_regs *fregs) > +{ > + struct pt_regs_stack *stack = this_cpu_ptr(&perf_fprobe_regs); > + struct pt_regs *regs; > + > + if (stack->idx < PERF_FPROBE_REGS_MAX) { > + regs = stack->regs[stack->idx++]; This is missing an &: regs = &stack->regs[stack->idx++]; > + return ftrace_partial_regs(fregs, regs); I think this is incorrect on arm64 and will likely cause very subtle failure modes down the line on other architectures too. The problem on arm64 is that Perf calls "user_mode(regs)" somewhere down the line, that macro tries to read the "pstate" register, which is not populated in ftrace_regs, so it's not copied into a "partial" pt_regs either and Perf can take wrong decisions based on that. I already mentioned this problem in the past: - in the third answer block of: https://lore.kernel.org/all/CABRcYmJjtVq-330ktqTAUiNO1=yG_aHd0xz=c550O5C7QP++UA@xxxxxxxxxxxxxx/ - in the fourth answer block of: https://lore.kernel.org/all/CABRcYm+esb8J2O1v6=C+h+HSa5NxraPUgo63w7-iZj0CXbpusg@xxxxxxxxxxxxxx/ It is quite possible that other architectures at some point introduce a light ftrace "args" trampoline that misses one of the registers expected by Perf because they don't realize that this trampoline calls fprobe which calls Perf which has specific registers expectations. We got the green light from Alexei to use ftrace_partial_regs for "BPF mutli_kprobe" because these BPF programs can gracefully deal with sparse pt_regs but I think a similar conversation needs to happen with the Perf folks. ---- On a side-note, a subtle difference between ftrace_partial_regs with and without HAVE_PT_REGS_TO_FTRACE_REGS_CAST is that one does a copy and the other does not. If a subsystem receives a partial regs under HAVE_PT_REGS_TO_FTRACE_REGS_CAST, it can modify register fields and the modified values will be restored by the ftrace trampoline. Without HAVE_PT_REGS_TO_FTRACE_REGS_CAST, only the copy will be modified and ftrace won't restore them. I think the least we can do is to document thoroughly the guarantees of the ftrace_partial_regs API: users shouldn't rely on modifying the resulting regs because depending on the architecture this could do different things. People shouldn't rely on any register that isn't covered by one of the ftrace_regs_get_* helpers because it can be unpopulated on some architectures. I believe this is the case for BPF multi_kprobe but not for Perf. > + } > + return NULL; > +} > + > +static __always_inline void perf_fprobe_return_regs(struct pt_regs *regs) > +{ > + struct pt_regs_stack *stack = this_cpu_ptr(&perf_fprobe_regs); > + > + if (WARN_ON_ONCE(regs != stack->regs[stack->idx])) This is missing an & too: if (WARN_ON_ONCE(regs != &stack->regs[stack->idx])) > + return; > + > + --stack->idx; > +} > + > +#endif /* !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS || CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */