Hi Jann, On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: >> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: ... >> > Considering I'm running strace build tests to provoke this bug, >> > finding the failure in a commit talking about ptrace changes does look >> > very suspicious...! >> > >> > Annoyingly, I can't reproduce this on my disparate other machines >> > here, suggesting it's maybe(?) timing related. > >Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >prints a warning to dmesg? If so, can you share that warning? I mean the machine locks hard - X stops updating, the mouse/keyboard stop responding. No pings, etc. When I reboot, there's nothing in the logs. >If you don't have any way to see console output, and you don't have a >working serial console setup or such, you may want to try re-running >those tests while the kernel is booted with netconsole enabled to log >to a different machine over UDP (see >https://www.kernel.org/doc/Documentation/networking/netconsole.txt). ACK, will try that now for you. >You may want to try setting the sysctl kernel.sysrq=1 , then when the >system has locked up, press ALT+PRINT+L (to generate stack traces for >all active CPUs from NMI context), and maybe also ALT+PRINT+T and >ALT+PRINT+W (to collect more information about active tasks). Nod. >(If you share stack traces from these things with us, it would be >helpful if you could run them through scripts/decode_stacktrace.pl >from the kernel tree first, to add line number information.) ACK. >Trying to isolate the problem: > >__end_current_label_crit_section and end_current_label_crit_section >are aliases of each other (via #define), so that line change can't >have done anything. > >That leaves two possibilities AFAICS: > - the might_sleep() call by itself is causing issues for one of the >remaining users of begin_current_label_crit_section() (because it >causes preemption to happen more often where it didn't happen on >PREEMPT_VOLUNTARY before, or because it's trying to print a warning >message with the runqueue lock held, or something like that) > - the lack of "if (aa_replace_current_label(label) == 0) >aa_put_label(label);" in __begin_current_label_crit_section() is >somehow causing issues > >You could try to see whether just adding the might_sleep(), or just >replacing the begin_current_label_crit_section() call with >__begin_current_label_crit_section(), triggers the bug. > > >If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >that isn't already set in your kernel config -, that might help track >down the problem, unless it magically makes the problem stop >triggering (which I guess would be conceivable if this indeed is a >race). OK, will try that second... -- Steve McIntyre, Cambridge, UK. steve@xxxxxxxxxx "I can't ever sleep on planes ... call it irrational if you like, but I'm afraid I'll miss my stop" -- Vivek Das Mohapatra