On (06/06/18 13:15), Petr Mladek wrote: > On Wed 2018-06-06 14:10:29, Sergey Senozhatsky wrote: > > On (06/05/18 14:47), Petr Mladek wrote: > > [..] > > > Grr, the ABBA deadlock is still there. NMIs are not sent to the other > > > CPUs atomically. Even if we detect that logbuf_lock is available > > > in printk_nmi_enter() on some CPUs, it might still get locked on > > > another CPU before the other CPU gets NMI. > > > > Can we do something about "B"? :) I mean - any chance we can rework > > locking in nmi_cpu_backtrace()? > > I can't think of any possibility at the moment. Well, it does not mean > that it does not exist. > > The irony is that we need the extra lock in nmi_cpu_backtrace() only > because we try to store the messages into the common log buffer. > If we always use the per-CPU buffers in NMI then the lock would > not cause any problems but it also won't be necessary. Yep. I think we can come up with something. It seems that the only problem is that we want in this particular case to avoid printk_nmi()->logbuf and instead we want to enforce per-CPU printk_nmi buffers. Then we can drop nmi_cpu_backtrace() spinlock, because the messages will be serialized by printk_safe flush spin_lock. That doesn't sound like an impossible thing to do. What am I missing? Could you please check my follow up email? > > > => I suggest to revert the commit 719f6a7040f1bdaf96fcc70 > > > "printk: Use the main logbuf in NMI when logbuf_lock is available" > > > for-4.18 and stable until we get a better solution. > > > > Just random thoughts. > > > > May be we need to revert it, but let's not "panic". I think [but don't > > insist on it] that the patch in question is *probably* "good enough". It > > addresses a bug report after all. > > It was a problem reported by me. I found it when testing other changes. > The patch improved the situation definitely. The question is if it is > enough in practice. Oh, certainly. But I was talking about 719f6a7040f1bdaf96fcc70. We introduced that change in response to a bug report from Steven. He would not be able to debug his kernel otherwise, because per-CPU printk_nmi was too limited in size. So on one hand we have the problem that you reported, which you found while you were hammering/testing NMI printk-s [a valid report on its own]; on the other hand we have the problem that Steven reported, which he triggered while he was debugging the kernel. It might be the case that Steven's problem is more likely to happen in real world. So that's why I proposed to keep 719f6a7040f1bda for the time being [until we come up with another fix]. I may be wrong. -ss