Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

Jann Horn <jannh@xxxxxxxxxx> · Fri, 26 Jun 2020 20:27:22 +0200

On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre <steve@xxxxxxxxxx> wrote:
> On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
> >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
> >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> >
> >...
> >
> >>> > Considering I'm running strace build tests to provoke this bug,
> >>> > finding the failure in a commit talking about ptrace changes does look
> >>> > very suspicious...!
> >>> >
> >>> > Annoyingly, I can't reproduce this on my disparate other machines
> >>> > here, suggesting it's maybe(?) timing related.
> >>
> >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
> >>prints a warning to dmesg? If so, can you share that warning?
> >
> >I mean the machine locks hard - X stops updating, the mouse/keyboard
> >stop responding. No pings, etc. When I reboot, there's nothing in the
> >logs.
> >
> >>If you don't have any way to see console output, and you don't have a
> >>working serial console setup or such, you may want to try re-running
> >>those tests while the kernel is booted with netconsole enabled to log
> >>to a different machine over UDP (see
> >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
> >
> >ACK, will try that now for you.
> >
> >>You may want to try setting the sysctl kernel.sysrq=1 , then when the
> >>system has locked up, press ALT+PRINT+L (to generate stack traces for
> >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
> >>ALT+PRINT+W (to collect more information about active tasks).
> >
> >Nod.
> >
> >>(If you share stack traces from these things with us, it would be
> >>helpful if you could run them through scripts/decode_stacktrace.pl
> >>from the kernel tree first, to add line number information.)
> >
> >ACK.
>
> Output passed through scripts/decode_stacktrace.sh attached.
>
> Just about to try John's suggestion next.

Okay, so this is some sort of deadlock...

Looking at the NMI backtraces, all the CPUs are blocked on spinlocks:
CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty()
CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait()
CPU 2 is blocked on something, somewhere in ptrace_stop()
CPU 0 is stuck on a lock in do_exit()

So I think it's probably something like a classic deadlock, or a
sleeping-in-atomic issue, or a lock-balancing issue (or memory
corruption, that can cause all kinds of weird errors)?

If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be
able to pinpoint the issue.
If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help.
If it is memory corruption, CONFIG_KASAN=y should discover it... but
that might majorly mess up the timing, so if this really is a race,
that might not work.

Maybe flip all of those on, and if it doesn't reproduce anymore, turn
off CONFIG_KASAN and try again?