Sorry for previous empty email.. Clicked send by accident. > On May 19, 2019, at 11:06 AM, Kairui Song <kasong@xxxxxxxxxx> wrote: > > On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> >> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >>> some other bfp functions) is now broken, or, strating an unwind >>> directly inside a bpf program will end up strangely. It have following >>> kernel message: >> >> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't >> follow. > > bpf_get_stackid_tp will just use the regs passed to it from the trace > point. And then it will eventually call perf_get_callchain to get the > call chain. > With a tracepoint we have the fake regs, so unwinder will start from > where it is called, and use the fake regs as the indicator of the > target frame it want, and keep unwinding until reached the actually > callsite. > > But if the stack trace is started withing a bpf func call then it's broken... > > If the unwinder could trace back through the bpf func call then there > will be no such problem. > > For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED) > before bpf call, and ensure bp is also dumped could fix it (so it will > start using the regs for bpf calls, like before the commit > d15d356887e7). But for ORC I don't see a clear way to fix the problem. > First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI, > DX, R10, R13, else?) in the trace point handler, then use the flag to > indicate ORC to do one more unwind (because can't get caller's regs, > so get callee's regs instaed) before actually giving output? > > I had a try, for framepointer unwinder, mark the indicator flag before > calling bpf functions, and dump bp as well in the trace point. Then > with frame pointer, it works, test passed: > > diff --git a/arch/x86/include/asm/perf_event.h > b/arch/x86/include/asm/perf_event.h > index 1392d5e6e8d6..6f1192e9776b 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct > pt_regs *regs); > > #include <asm/stacktrace.h> > > +#ifdef CONFIG_FRAME_POINTER > +static inline unsigned long caller_frame_pointer(void) > +{ > + return (unsigned long)__builtin_frame_address(1); > +} > +#else > +static inline unsigned long caller_frame_pointer(void) > +{ > + return 0; > +} > +#endif > + > /* > * We abuse bit 3 from flags to pass exact information, see perf_misc_flags > * and the comment with PERF_EFLAGS_EXACT. > */ > #define perf_arch_fetch_caller_regs(regs, __ip) { \ > (regs)->ip = (__ip); \ > + (regs)->bp = caller_frame_pointer(); \ > (regs)->sp = (unsigned long)__builtin_frame_address(0); \ > (regs)->cs = __KERNEL_CS; \ > regs->flags = 0; \ > diff --git a/kernel/events/core.c b/kernel/events/core.c > index abbd4b3b96c2..ca7b95ee74f0 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data, > int size, int rctx, > struct task_struct *task) > { > if (bpf_prog_array_valid(call)) { > + regs->flags |= X86_EFLAGS_FIXED; > *(struct pt_regs **)raw_data = regs; > if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) { > perf_swevent_put_recursion_context(rctx); > @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event, > int ret = 0; > > ctx.regs = perf_arch_bpf_user_pt_regs(regs); > + ctx.regs->flags |= X86_EFLAGS_FIXED; > + > preempt_disable(); > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) > goto out; > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index f92d6ad5e080..e1fa656677dc 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 > flags, void *meta, u64 meta_size, > }; > > perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + > perf_sample_data_init(sd, 0, 0); > sd->raw = &raw; > > @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct > bpf_raw_tracepoint_args *, args, > struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); > > perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + > return ____bpf_perf_event_output(regs, map, flags, data, size); > } > > @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct > bpf_raw_tracepoint_args *, args, > struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); > > perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + > /* similar to bpf_perf_event_output_tp, but pt_regs fetched > differently */ > return bpf_get_stackid((unsigned long) regs, (unsigned long) map, > flags, 0, 0); > @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct > bpf_raw_tracepoint_args *, args, > struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); > > perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + > return bpf_get_stack((unsigned long) regs, (unsigned long) buf, > (unsigned long) size, flags, 0); > } > I think we really cannot do something above, as it leaks x86 specific code into kernel/events and kernel/trace. > And *_raw_tp functions will fetch the regs by themselves so a bit > trouble some... > > ---------- > > And another approach is to make unwinder direct unwinding works when > called by bpf (if possible and reasonable). I also had a nasty hacky > experiment (posted below) to just force frame pointer unwinder's > get_stack_info pass for bpf, then problem is fixed without any other > workaround: > > diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c > index 753b8cfe8b8a..c0cfdf25f5ed 100644 > --- a/arch/x86/kernel/dumpstack_64.c > +++ b/arch/x86/kernel/dumpstack_64.c > @@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct > task_struct *task, > if (in_entry_stack(stack, info)) > goto recursion_check; > > - goto unknown; > + goto recursion_check; > > recursion_check: > /* I think this one doesn't work for ORC either? Thanks, Song > Don't know how to do it the right way, or is it even possible for all > unwinders yet... > > -- > Best Regards, > Kairui Song