On Wed, Aug 19, 2020 at 05:32:50PM +0200, peterz@xxxxxxxxxxxxx wrote: > On Wed, Aug 19, 2020 at 08:39:13PM +1000, Alexey Kardashevskiy wrote: > > > > or current upstream? > > > > The upstream 18445bf405cb (13 hours old) also shows the problem. Yours > > 1/2 still fixes it. > > Afaict that just reduces the window. > > Isn't the problem that: > > arch/powerpc/kernel/exceptions-64e.S > > START_EXCEPTION(perfmon); > NORMAL_EXCEPTION_PROLOG(0x260, BOOKE_INTERRUPT_PERFORMANCE_MONITOR, > PROLOG_ADDITION_NONE) > EXCEPTION_COMMON(0x260) > INTS_DISABLE > # RECONCILE_IRQ_STATE > # TRACE_DISABLE_INTS > # TRACE_WITH_FRAME_BUFFER(trace_hardirqs_off) > # > # but we haven't done nmi_enter() yet... whoopsy > > CHECK_NAPPING() > addi r3,r1,STACK_FRAME_OVERHEAD > bl performance_monitor_exception > # perf_irq() > # perf_event_interrupt > # __perf_event_interrupt > # nmi_enter() > > > > That is, afaict your entry code is buggered. That is, patch 1/2 doesn't change the case: local_irq_enable() trace_hardirqs_on() <NMI> trace_hardirqs_off() ... if (regs_irqs_disabled(regs)) // false trace_hardirqs_on(); </NMI> raw_local_irq_enable() Where local_irq_enable() has done trace_hardirqs_on() and the NMI hits and undoes it, but doesn't re-do it because the hardware state is still disabled. What's supposed to happen is: <NMI> nmi_enter() trace_hardirqs_off() // no-op, because in_nmi() (or previously because lockdep_off()) ... </NMI>