On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre <steve@xxxxxxxxxx> wrote: > On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote: > >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: > >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > > >... > > > >>> > Considering I'm running strace build tests to provoke this bug, > >>> > finding the failure in a commit talking about ptrace changes does look > >>> > very suspicious...! > >>> > > >>> > Annoyingly, I can't reproduce this on my disparate other machines > >>> > here, suggesting it's maybe(?) timing related. > >> > >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure > >>prints a warning to dmesg? If so, can you share that warning? > > > >I mean the machine locks hard - X stops updating, the mouse/keyboard > >stop responding. No pings, etc. When I reboot, there's nothing in the > >logs. > > > >>If you don't have any way to see console output, and you don't have a > >>working serial console setup or such, you may want to try re-running > >>those tests while the kernel is booted with netconsole enabled to log > >>to a different machine over UDP (see > >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > > > >ACK, will try that now for you. > > > >>You may want to try setting the sysctl kernel.sysrq=1 , then when the > >>system has locked up, press ALT+PRINT+L (to generate stack traces for > >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and > >>ALT+PRINT+W (to collect more information about active tasks). > > > >Nod. > > > >>(If you share stack traces from these things with us, it would be > >>helpful if you could run them through scripts/decode_stacktrace.pl > >>from the kernel tree first, to add line number information.) > > > >ACK. > > Output passed through scripts/decode_stacktrace.sh attached. > > Just about to try John's suggestion next. Okay, so this is some sort of deadlock... Looking at the NMI backtraces, all the CPUs are blocked on spinlocks: CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty() CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait() CPU 2 is blocked on something, somewhere in ptrace_stop() CPU 0 is stuck on a lock in do_exit() So I think it's probably something like a classic deadlock, or a sleeping-in-atomic issue, or a lock-balancing issue (or memory corruption, that can cause all kinds of weird errors)? If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be able to pinpoint the issue. If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help. If it is memory corruption, CONFIG_KASAN=y should discover it... but that might majorly mess up the timing, so if this really is a race, that might not work. Maybe flip all of those on, and if it doesn't reproduce anymore, turn off CONFIG_KASAN and try again?