Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Tue, 2 Feb 2021 11:45:41 +0100

On Sat, Jan 30, 2021 at 07:44:10AM -0500, Steven Rostedt wrote:
> On Sat, 30 Jan 2021 09:28:32 +0100
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> 
> > On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> > > Specifically, kprobe and ftrace callbacks may have this:
> > > 
> > > 	if (in_nmi())
> > > 		return;
> > > 
> > > 	raw_spin_lock_irqsave(&lock, flags);
> > > 	[..]
> > > 	raw_spin_unlock_irqrestore(&lock, flags);
> > > 
> > > Which is totally fine to have,  
> > 
> > Why? There's a distinct lack of explaining here.
> > 
> > Note that we ripped out all such dodgy locking from kretprobes.
> 
> Actually, I think you helped explain the distinction. You mention
> "kretpobes" do you mean the infrastructure of kretprobes or all its
> users?

kretprobe infra.

> The infrastructure of ftrace and kprobes can work in any context, it
> does not mean that the callbacks must. Again, these are more like
> exceptions. Why have "in_nmi()"? If anything that can be called by an
> NMI should just work, right? That's basically your argument for having
> ftrace and kprobes set "in_nmi".
> 
> You can have locking in NMIs if the locking is *only* in NMI handlers,
> right? If that's the case, then so should ftrace and kprobe callbacks.

Which is still dodgy as heck. NMIs _can_ nest. Now mostly it doesn't
happen, and the few sites that do use spinlocks from NMI context are
sure to use it from a specific NMI 'handler' context which typically
don't nest.

But I still utterly detest the idea of using spinlocks from NMI. It's
inherently fragile.

> The stack tracer checks the size of the stack, compares it to the
> largest recorded size, and if it's bigger, it will save the stack. But
> if this happens on two CPUs at the same time, only one can do the
> recording at the same time. To synchronize this, a spin lock must be
> taken. Similar to spin locks in an NMI.

That sounds like something cmpxchg() should be able to do.

Have a per-cpu stack trace buffer and a global max one, when cpu local
exceeds previous max, cmpxchg the buffer.

> But the problem here is, the callbacks can also be done from an NMI
> context, so if we are in NMI, we don't want to take any locks, and
> simply don't record the stack traces from NMIs.

Which is obviously shit :-) The NMI might have interesting stack usage.

> The more I think about it, the more I hate the idea that ftrace
> callbacks and kprobes are considered NMIs. Simply because they are not!

Yet they happen when IRQs are off, so they are ;-)

Also, given how everything can nest, it had better all be lockless
anyway. You can get your regular function trace interrupted, which can
hit a #DB, which can function trace, which can #BP which can function
trace again which can get #NMI etc.. Many wonderfun nestings possible.

And god knows what these handlers end up calling.

The only sane approach is treating it all as NMI and having it all
lockless.