On Thu, 27 Sep 2018 12:46:01 -0700 Daniel Wang <wonderfly@xxxxxxxxxx> wrote: > Prior to this change, the combination of `softlockup_panic=1` and > `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot path > is trying to grab the console lock that is held by the stack trace printing > path. What seems to be happening is that while there are multiple CPUs, only one > of them is tasked to print the back trace of all CPUs. On a machine with many > CPUs and a slow serial console (on Google Compute Engine for example), the stack > trace printing routine hits a timeout and the reboot path kicks in. The latter > then tries to print something else, but can't get the lock because it's still > held by earlier printing path. This is easily reproducible on a VM with 16+ > vCPUs on Google Compute Engine - which is a very common scenario. > > A quick repro is available at > https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds > into executing repro.sh. Both deadlock analysis and repro are credits to Peter > Feiner. > > Note that I have read previous discussions on backporting this to stable [1]. > The argument for objecting the backport was that this is a non-trivial fix and > is supported to prevent hypothetical soft lockups. What we are hitting is a real > deadlock, in production, however. Hence this request. > > [1] https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3xk3t@xxxxxxxxxxxxxxx/T/#u > > Serial console logs leading up to the deadlock. As can be seen the stack trace > was incomplete because the printing path hit a timeout. I'm fine with having this backported. -- Steve