On 5/24/19 3:37 PM, Daniel Borkmann wrote: > On 05/25/2019 12:20 AM, Yonghong Song wrote: >> On 5/24/19 2:39 PM, Daniel Borkmann wrote: >>> On 05/23/2019 11:47 PM, Yonghong Song wrote: >>>> This patch tries to solve the following specific use case. >>>> >>>> Currently, bpf program can already collect stack traces >>>> through kernel function get_perf_callchain() >>>> when certain events happens (e.g., cache miss counter or >>>> cpu clock counter overflows). But such stack traces are >>>> not enough for jitted programs, e.g., hhvm (jited php). >>>> To get real stack trace, jit engine internal data structures >>>> need to be traversed in order to get the real user functions. >>>> >>>> bpf program itself may not be the best place to traverse >>>> the jit engine as the traversing logic could be complex and >>>> it is not a stable interface either. >>>> >>>> Instead, hhvm implements a signal handler, >>>> e.g. for SIGALARM, and a set of program locations which >>>> it can dump stack traces. When it receives a signal, it will >>>> dump the stack in next such program location. >>>> >>>> Such a mechanism can be implemented in the following way: >>>> . a perf ring buffer is created between bpf program >>>> and tracing app. >>>> . once a particular event happens, bpf program writes >>>> to the ring buffer and the tracing app gets notified. >>>> . the tracing app sends a signal SIGALARM to the hhvm. >>>> >>>> But this method could have large delays and causing profiling >>>> results skewed. >>>> >>>> This patch implements bpf_send_signal() helper to send >>>> a signal to hhvm in real time, resulting in intended stack traces. >>>> >>>> Acked-by: Andrii Nakryiko <andriin@xxxxxx> >>>> Signed-off-by: Yonghong Song <yhs@xxxxxx> >>>> --- >>>> include/uapi/linux/bpf.h | 17 +++++++++- >>>> kernel/trace/bpf_trace.c | 72 ++++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 88 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>>> index 63e0cf66f01a..68d4470523a0 100644 >>>> --- a/include/uapi/linux/bpf.h >>>> +++ b/include/uapi/linux/bpf.h >>>> @@ -2672,6 +2672,20 @@ union bpf_attr { >>>> * 0 on success. >>>> * >>>> * **-ENOENT** if the bpf-local-storage cannot be found. >>>> + * >>>> + * int bpf_send_signal(u32 sig) >>>> + * Description >>>> + * Send signal *sig* to the current task. >>>> + * Return >>>> + * 0 on success or successfully queued. >>>> + * >>>> + * **-EBUSY** if work queue under nmi is full. >>>> + * >>>> + * **-EINVAL** if *sig* is invalid. >>>> + * >>>> + * **-EPERM** if no permission to send the *sig*. >>>> + * >>>> + * **-EAGAIN** if bpf program can try again. >>>> */ >>>> #define __BPF_FUNC_MAPPER(FN) \ >>>> FN(unspec), \ >>>> @@ -2782,7 +2796,8 @@ union bpf_attr { >>>> FN(strtol), \ >>>> FN(strtoul), \ >>>> FN(sk_storage_get), \ >>>> - FN(sk_storage_delete), >>>> + FN(sk_storage_delete), \ >>>> + FN(send_signal), >>>> >>>> /* integer value in 'imm' field of BPF_CALL instruction selects which helper >>>> * function eBPF program intends to call >>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >>>> index f92d6ad5e080..70029eafc71f 100644 >>>> --- a/kernel/trace/bpf_trace.c >>>> +++ b/kernel/trace/bpf_trace.c >>>> @@ -567,6 +567,63 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = { >>>> .arg3_type = ARG_ANYTHING, >>>> }; >>>> >>>> +struct send_signal_irq_work { >>>> + struct irq_work irq_work; >>>> + struct task_struct *task; >>>> + u32 sig; >>>> +}; >>>> + >>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work); >>>> + >>>> +static void do_bpf_send_signal(struct irq_work *entry) >>>> +{ >>>> + struct send_signal_irq_work *work; >>>> + >>>> + work = container_of(entry, struct send_signal_irq_work, irq_work); >>>> + group_send_sig_info(work->sig, SEND_SIG_PRIV, work->task, PIDTYPE_TGID); >>>> +} >>>> + >>>> +BPF_CALL_1(bpf_send_signal, u32, sig) >>>> +{ >>>> + struct send_signal_irq_work *work = NULL; >>>> + >>>> + /* Similar to bpf_probe_write_user, task needs to be >>>> + * in a sound condition and kernel memory access be >>>> + * permitted in order to send signal to the current >>>> + * task. >>>> + */ >>>> + if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING))) >>>> + return -EPERM; >>>> + if (unlikely(uaccess_kernel())) >>>> + return -EPERM; >>>> + if (unlikely(!nmi_uaccess_okay())) >>>> + return -EPERM; >>>> + >>>> + if (in_nmi()) { >>>> + work = this_cpu_ptr(&send_signal_work); >>>> + if (work->irq_work.flags & IRQ_WORK_BUSY) >>> >>> Given here and in stackmap are the only two users outside of kernel/irq_work.c, >>> it may probably be good to add a small helper to include/linux/irq_work.h and >>> use it for both. >>> >>> Perhaps something like ... >>> >>> static inline bool irq_work_busy(struct irq_work *work) >>> { >>> return READ_ONCE(work->flags) & IRQ_WORK_BUSY; >>> } >> >> Not sure whether READ_ONCE is needed here or not. >> >> The irq_work is per cpu data structure, >> static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work); >> so presumably no collision for work->flags memory reference. > > The busy bit you're testing is cleared via cmpxchg(), kernel/irq_work.c +169: > > cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY); Looks like for bpf case, we have bpf_prog_active guarding, so nested nmi won't trigger nested bpf_send_signal() call, so we should be fine without READ_ONCE here. Also, struct irq_work { unsigned long flags; struct llist_node llnode; void (*func)(struct irq_work *); }; -bash-4.4$ egrep -r 'work->flags' irq_work.c flags = work->flags & ~IRQ_WORK_PENDING; oflags = cmpxchg(&work->flags, flags, nflags); if (work->flags & IRQ_WORK_LAZY) { flags = work->flags & ~IRQ_WORK_PENDING; xchg(&work->flags, flags); (void)cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY); while (work->flags & IRQ_WORK_BUSY) -bash-4.4$ From the above, `flags` is unsigned long and no READ_ONCE is used, people may already assume atomic read_once is available, so not using it.