Just got back from vacation. Thanks for the continued discussion. Just so I understand the current state. Looks like we've got a pretty good explanation of what's going on (though not completely sure), and backporting Steven's patches is still the way to go? I see that Sergey had sent an RFC series for similar things. Are those trying to solve the deadlock problem in a different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> wrote: > > On (10/04/18 10:36), Petr Mladek wrote: > > > > This looks like a reasonable explanation of what is happening here. > > It also explains why the console owner logic helped. > > Well, I'm still a bit puzzled, frankly speaking. I've two theories. > > Theory #1 [most likely] > > Steven is a wizard and his code cures whatever problem we throw it at. > > Theory #2 > > console_sem hand over actually spreads print out, so we don't have one CPU > doing all the printing job. Instead every CPU prints its backtrace, while the > CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace() > still has to wait for NR_CPUS * strlen(bakctrace), which still probably > truggers NMI panic on it at some point. The panic CPU send out stop IPI, then > it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each > CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with > flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without > the patch we probably declare NMI panic on the CPU which does all the printing > work, and panic sometimes jumps in when that CPU is in busy in > serial8250_console_write(), holding the uart_port->lock. So we can't re-enter > the 8250 driver from panic CPU and we can't reboot the system. In other > words... Steven is a wizard. > > > > serial8250_console_write() > > > { > > > if (port->sysrq) > > > locked = 0; > > > else if (oops_in_progress) > > > locked = spin_trylock_irqsave(&port->lock, flags); > > > else > > > spin_lock_irqsave(&port->lock, flags); > > > > > > ... > > > uart_console_write(port, s, count, serial8250_console_putchar); > > > ... > > > > > > if (locked) > > > spin_unlock_irqrestore(&port->lock, flags); > > > } > > > > > > Now... the problem. A theory, in fact. > > > panic() sets oops_in_progress back to zero - bust_spinlocks(0) - too soon. > > > > I see your point. I am just a bit scared of this way. Ignoring locks > > is a dangerous and painful approach in general. > > Well, I agree. But 8250 is not the only console which does ignore > uart_port lock state sometimes. Otherwise sysrq would be totally unreliable, > including emergency reboot. So it's sort of how it has been for quite some > time, I guess. We are in panic(), it's over, so we probably can ignore > uart_port->lock at this point. > > -ss -- Best, Daniel
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature