On (21/02/10 11:48), Muchun Song wrote: > printk_safe_flush_on_panic() caused the following deadlock on our > server: > > CPU0: CPU1: > panic rcu_dump_cpu_stacks > kdump_nmi_shootdown_cpus nmi_trigger_cpumask_backtrace > register_nmi_handler(crash_nmi_callback) printk_safe_flush > __printk_safe_flush > raw_spin_lock_irqsave(&read_lock) > // send NMI to other processors > apic_send_IPI_allbutself(NMI_VECTOR) > // NMI interrupt, dead loop > crash_nmi_callback > printk_safe_flush_on_panic > printk_safe_flush > __printk_safe_flush > // deadlock > raw_spin_lock_irqsave(&read_lock) > > DEADLOCK: read_lock is taken on CPU1 and will never get released. > > It happens when panic() stops a CPU by NMI while it has been in > the middle of printk_safe_flush(). > > Handle the lock the same way as logbuf_lock. The printk_safe buffers > are flushed only when both locks can be safely taken. It can avoid > the deadlock _in this particular case_ at expense of losing contents > of printk_safe buffers. > > Note: It would actually be safe to re-init the locks when all CPUs were > stopped by NMI. But it would require passing this information > from arch-specific code. It is not worth the complexity. > Especially because logbuf_lock and printk_safe buffers have been > obsoleted by the lockless ring buffer. > > Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic") > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> > Reviewed-by: Petr Mladek <pmladek@xxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx> -ss