On Mon, Mar 09, 2020 at 06:40:45PM -0700, Alexei Starovoitov wrote: > On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote: > > > > > > But what's relevant is the tracer overhead which is e.g. inflicted > > > with todays trace_hardirqs_off/on() implementation because that > > > unconditionally uses the rcuidle variant with the scru/rcu_irq dance > > > around every tracepoint. > > > > I think one of the big issues here is that most of the uses of > > trace_hardirqs_off() are from sites which already have RCU watching, > > so we are doing heavy-weight operations for nothing. > > I think kernel/trace/trace_preemptirq.c created too many problems for the > kernel without providing tangible benefits. My understanding no one is using it > in production. Hi Alexei, There are various people use the preempt/irq disable tracepoints for last 2 years at Google and ARM. There's also a BPF tool (in BCC) that uses those for tracing critical sections. Also Daniel Bristot's entire Preempt-IRQ formal verification stuff depends on it. > It's a tool to understand how kernel works. And such debugging > tool can and should be removed. If we go by that line of reasoning, then function tracing also should be removed from the kernel. I am glad Thomas and Peter are working on it and looking forward to seeing the patches, thanks, - Joel > One of Thomas's patches mentioned that bpf can be invoked from hardirq and > preempt tracers. This connection doesn't exist in a direct way, but > theoretically it's possible. There is no practical use though and I would be > happy to blacklist such bpf usage at a minimum. > > > We could use the approach proposed by Peterz's and Steven's patches to basically > > do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable > > RCU for those cases. We could then simply go back on using regular RCU like so: > > > > #define __DO_TRACE(tp, proto, args, cond, rcuidle) \ > > do { \ > > struct tracepoint_func *it_func_ptr; \ > > void *it_func; \ > > void *__data; \ > > bool exit_rcu = false; \ > > \ > > if (!(cond)) \ > > return; \ > > \ > > if (rcuidle && !rcu_is_watching()) { \ > > rcu_irq_enter_irqson(); \ > > exit_rcu = true; \ > > } \ > > preempt_disable_notrace(); \ > > it_func_ptr = rcu_dereference_raw((tp)->funcs); \ > > if (it_func_ptr) { \ > > do { \ > > it_func = (it_func_ptr)->func; \ > > __data = (it_func_ptr)->data; \ > > ((void(*)(proto))(it_func))(args); \ > > } while ((++it_func_ptr)->func); \ > > } \ > > preempt_enable_notrace(); \ > > if (exit_rcu) \ > > rcu_irq_exit_irqson(); \ > > } while (0) > > I think it's a fine approach interim. > > Long term sounds like Paul is going to provide sleepable and low overhead > rcu_read_lock_for_tracers() that will include bpf. > My understanding that this new rcu flavor won't have "idle" issues, > so rcu_is_watching() checks will not be necessary. > And if we remove trace_preemptirq.c the only thing left will be Thomas's points > 1 (low level entry) and 2 (breakpoints) that can be addressed without > creating fancy .text annotations and teach objtool about it. > > In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy. > srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog. > I'm proceeding with it anyway, but really hoping that > rcu_read_lock_for_tracers() will materialize soon. > > In general I'm sceptical that .text annotations will work. Let's say all of > idle is a red zone. But a ton of normal functions are called when idle. So > objtool will go and mark them as red zone too. This way large percent of the > kernel will be off limits for tracers. Which is imo not a good trade off. I > think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover > all practical cases where people can shot themselves in a foot with a tracer. I > realize that there will be forever whack-a-mole game and these annotations will > never reach 100%. I think it's a fine trade off. Security is never 100% either. > Tracing is never going to be 100% safe too.