On 5/23/19 9:28 AM, Daniel Borkmann wrote: > On 05/23/2019 05:58 PM, Yonghong Song wrote: >> On 5/23/19 8:41 AM, Daniel Borkmann wrote: >>> On 05/22/2019 07:39 AM, Yonghong Song wrote: >>>> This patch tries to solve the following specific use case. >>>> >>>> Currently, bpf program can already collect stack traces >>>> through kernel function get_perf_callchain() >>>> when certain events happens (e.g., cache miss counter or >>>> cpu clock counter overflows). But such stack traces are >>>> not enough for jitted programs, e.g., hhvm (jited php). >>>> To get real stack trace, jit engine internal data structures >>>> need to be traversed in order to get the real user functions. >>>> >>>> bpf program itself may not be the best place to traverse >>>> the jit engine as the traversing logic could be complex and >>>> it is not a stable interface either. >>>> >>>> Instead, hhvm implements a signal handler, >>>> e.g. for SIGALARM, and a set of program locations which >>>> it can dump stack traces. When it receives a signal, it will >>>> dump the stack in next such program location. >>>> >>>> Such a mechanism can be implemented in the following way: >>>> . a perf ring buffer is created between bpf program >>>> and tracing app. >>>> . once a particular event happens, bpf program writes >>>> to the ring buffer and the tracing app gets notified. >>>> . the tracing app sends a signal SIGALARM to the hhvm. >>>> >>>> But this method could have large delays and causing profiling >>>> results skewed. >>>> >>>> This patch implements bpf_send_signal() helper to send >>>> a signal to hhvm in real time, resulting in intended stack traces. >>>> >>>> Signed-off-by: Yonghong Song <yhs@xxxxxx> >>>> --- >>>> include/uapi/linux/bpf.h | 17 +++++++++- >>>> kernel/trace/bpf_trace.c | 67 ++++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 83 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>>> index 63e0cf66f01a..68d4470523a0 100644 >>>> --- a/include/uapi/linux/bpf.h >>>> +++ b/include/uapi/linux/bpf.h >>>> @@ -2672,6 +2672,20 @@ union bpf_attr { >>>> * 0 on success. >>>> * >>>> * **-ENOENT** if the bpf-local-storage cannot be found. >>>> + * >>>> + * int bpf_send_signal(u32 sig) >>>> + * Description >>>> + * Send signal *sig* to the current task. >>>> + * Return >>>> + * 0 on success or successfully queued. >>>> + * >>>> + * **-EBUSY** if work queue under nmi is full. >>>> + * >>>> + * **-EINVAL** if *sig* is invalid. >>>> + * >>>> + * **-EPERM** if no permission to send the *sig*. >>>> + * >>>> + * **-EAGAIN** if bpf program can try again. >>>> */ >>>> #define __BPF_FUNC_MAPPER(FN) \ >>>> FN(unspec), \ >>>> @@ -2782,7 +2796,8 @@ union bpf_attr { >>>> FN(strtol), \ >>>> FN(strtoul), \ >>>> FN(sk_storage_get), \ >>>> - FN(sk_storage_delete), >>>> + FN(sk_storage_delete), \ >>>> + FN(send_signal), >>>> >>>> /* integer value in 'imm' field of BPF_CALL instruction selects which helper >>>> * function eBPF program intends to call >>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >>>> index f92d6ad5e080..f8cd0db7289f 100644 >>>> --- a/kernel/trace/bpf_trace.c >>>> +++ b/kernel/trace/bpf_trace.c >>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = { >>>> .arg3_type = ARG_ANYTHING, >>>> }; >>>> >>>> +struct send_signal_irq_work { >>>> + struct irq_work irq_work; >>>> + u32 sig; >>>> +}; >>>> + >>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work); >>>> + >>>> +static void do_bpf_send_signal(struct irq_work *entry) >>>> +{ >>>> + struct send_signal_irq_work *work; >>>> + >>>> + work = container_of(entry, struct send_signal_irq_work, irq_work); >>>> + group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID); >>>> +} >>>> + >>>> +BPF_CALL_1(bpf_send_signal, u32, sig) >>>> +{ >>>> + struct send_signal_irq_work *work = NULL; >>>> + >>>> + /* Similar to bpf_probe_write_user, task needs to be >>>> + * in a sound condition and kernel memory access be >>>> + * permitted in order to send signal to the current >>>> + * task. >>>> + */ >>>> + if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING))) >>>> + return -EPERM; >>>> + if (unlikely(uaccess_kernel())) >>>> + return -EPERM; >>>> + if (unlikely(!nmi_uaccess_okay())) >>>> + return -EPERM; >>>> + >>>> + if (in_nmi()) { >>> >>> Hm, bit confused, can't this only be done out of process context in >>> general since only there current points to e.g. hhvm? I'm probably >>> missing something. Could you elaborate? >> >> That is true. If in nmi, it is out of process context and in nmi >> context, we use an irq_work here since group_send_sig_info() has >> spinlock inside. The bpf program (e.g., a perf_event program) needs to >> check it is with right current (e.g., by pid) before calling >> this helper. >> >> Does this address your question? Thanks, Daniel. The below are really good questions which I did not really think through with irq_work. > > Hm, but how is it guaranteed that 'current' inside the callback is still > the very same you intend to send the signal to? I went through irq_work infrastructure. It looks that "current" may change. irq_work is registered as an interrupt on x86. After nmi, some lower priority interrupts get chances to execute including irq_work. But there are some other interrupts like timer_interrupt and reschedule_interrupt may change "current". But since we are still in interrupt context, task should not be destroyed, so the task structure pointer is still valid. I will pass "current" task struct pointer to irq_work as well. This is similar to what we did in stackmap.c: work->sem = ¤t->mm->mmap_sem; irq_work_queue(&work->irq_work); At irq_work_run() time, the previous "current" in nmi should still be valid. > > What happens if you're in softirq and send SIGKILL to yourself? Is this > ignored/handled gracefully in such case? It is not ignored. It handled gracefully in this case. I tried my example to send SIGKILL. The call stack looks below. [ 24.211943] bpf_send_signal+0x9/0xb0 [ 24.212427] bpf_prog_fec6e7cc664d5b91_bpf_send_signal_test+0x228/0x1000 [ 24.213249] ? bpf_overflow_handler+0xb7/0x180 [ 24.213853] ? __perf_event_overflow+0x51/0xe0 [ 24.214385] ? perf_swevent_hrtimer+0x10a/0x160 [ 24.214965] ? __update_load_avg_cfs_rq+0x1a9/0x1c0 [ 24.215609] ? task_tick_fair+0x50/0x690 [ 24.216104] ? run_timer_softirq+0x208/0x490 [ 24.216637] ? timerqueue_del+0x1e/0x40 [ 24.217111] ? task_clock_event_del+0x10/0x10 [ 24.217658] ? __hrtimer_run_queues+0x10d/0x2c0 [ 24.218217] ? hrtimer_interrupt+0x122/0x270 [ 24.218765] ? rcu_irq_enter+0x31/0x110 [ 24.219223] ? smp_apic_timer_interrupt+0x67/0x160 [ 24.219842] ? apic_timer_interrupt+0xf/0x20 [ 24.220383] </IRQ> [ 24.220655] ? event_sched_out.isra.108+0x150/0x150 [ 24.221271] ? smp_call_function_single+0xdc/0x100 [ 24.221898] ? perf_event_sysfs_show+0x20/0x20 [ 24.222469] ? cpu_function_call+0x42/0x60 [ 24.222982] ? cpu_clock_event_read+0x10/0x10 [ 24.223525] ? event_function_call+0xe6/0xf0 [ 24.224053] ? event_sched_out.isra.108+0x150/0x150 [ 24.224657] ? perf_remove_from_context+0x20/0x70 [ 24.225262] ? perf_event_release_kernel+0x106/0x2e0 [ 24.225896] ? perf_release+0xc/0x10 [ 24.226347] ? __fput+0xc9/0x230 [ 24.226767] ? task_work_run+0x83/0xb0 [ 24.227243] ? do_exit+0x300/0xc50 [ 24.227674] ? syscall_trace_enter+0x1c9/0x2d0 [ 24.228223] ? do_group_exit+0x39/0xb0 [ 24.228695] ? __x64_sys_exit_group+0x14/0x20 [ 24.229270] ? do_syscall_64+0x49/0x130 [ 24.229762] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 I see the task is killed and other process is not impacted and no kernel crash/warning. > > I think some more elaborate comment in the code would definitely be help. Definitely will add some comments. > > Btw, you probably need to wrap it under #ifdef CONFIG_IRQ_WORK. I will check this. stackmaps.c use irq_work as well and did not have CONFIG_IRQ_WORK. Maybe we are missing there as well. > >>>> + work = this_cpu_ptr(&send_signal_work); >>>> + if (work->irq_work.flags & IRQ_WORK_BUSY) >>>> + return -EBUSY; >>>> + >>>> + work->sig = sig; >>>> + irq_work_queue(&work->irq_work); >>>> + return 0; >>>> + } >>>> + >>>> + return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID); >>>> + >>> >>> Nit: extra newline slipped in >> Thanks. Will remove this in the next revision. >>> >>>> +} >>>> + >>>> +static const struct bpf_func_proto bpf_send_signal_proto = { >>>> + .func = bpf_send_signal, >>>> + .gpl_only = false, >>>> + .ret_type = RET_INTEGER, >>>> + .arg1_type = ARG_ANYTHING, >>>> +}; >>>> + >>>> static const struct bpf_func_proto * >>>> tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) >>>> { >>>> @@ -617,6 +669,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) >>>> case BPF_FUNC_get_current_cgroup_id: >>>> return &bpf_get_current_cgroup_id_proto; >>>> #endif >>>> + case BPF_FUNC_send_signal: >>>> + return &bpf_send_signal_proto; >>>> default: >>>> return NULL; >>>> } >>>> @@ -1343,5 +1397,18 @@ static int __init bpf_event_init(void) >>>> return 0; >>>> } >>>> >>>> +static int __init send_signal_irq_work_init(void) >>>> +{ >>>> + int cpu; >>>> + struct send_signal_irq_work *work; >>>> + >>>> + for_each_possible_cpu(cpu) { >>>> + work = per_cpu_ptr(&send_signal_work, cpu); >>>> + init_irq_work(&work->irq_work, do_bpf_send_signal); >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> fs_initcall(bpf_event_init); >>>> +subsys_initcall(send_signal_irq_work_init); >>>> #endif /* CONFIG_MODULES */ >>>> >>> >