On 6/26/20 9:50 AM, Steve McIntyre wrote: > Hi Jann, > > On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >> On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > ... > >>>> Considering I'm running strace build tests to provoke this bug, >>>> finding the failure in a commit talking about ptrace changes does look >>>> very suspicious...! >>>> >>>> Annoyingly, I can't reproduce this on my disparate other machines >>>> here, suggesting it's maybe(?) timing related. >> >> Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >> prints a warning to dmesg? If so, can you share that warning? > > I mean the machine locks hard - X stops updating, the mouse/keyboard > stop responding. No pings, etc. When I reboot, there's nothing in the > logs. > >> If you don't have any way to see console output, and you don't have a >> working serial console setup or such, you may want to try re-running >> those tests while the kernel is booted with netconsole enabled to log >> to a different machine over UDP (see >> https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > > ACK, will try that now for you. > >> You may want to try setting the sysctl kernel.sysrq=1 , then when the >> system has locked up, press ALT+PRINT+L (to generate stack traces for >> all active CPUs from NMI context), and maybe also ALT+PRINT+T and >> ALT+PRINT+W (to collect more information about active tasks). > > Nod. > >> (If you share stack traces from these things with us, it would be >> helpful if you could run them through scripts/decode_stacktrace.pl >>from the kernel tree first, to add line number information.) > > ACK. > >> Trying to isolate the problem: >> >> __end_current_label_crit_section and end_current_label_crit_section >> are aliases of each other (via #define), so that line change can't >> have done anything. >> >> That leaves two possibilities AFAICS: >> - the might_sleep() call by itself is causing issues for one of the >> remaining users of begin_current_label_crit_section() (because it >> causes preemption to happen more often where it didn't happen on >> PREEMPT_VOLUNTARY before, or because it's trying to print a warning >> message with the runqueue lock held, or something like that) >> - the lack of "if (aa_replace_current_label(label) == 0) >> aa_put_label(label);" in __begin_current_label_crit_section() is >> somehow causing issues >> >> You could try to see whether just adding the might_sleep(), or just >> replacing the begin_current_label_crit_section() call with >> __begin_current_label_crit_section(), triggers the bug. >> >> >> If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >> that isn't already set in your kernel config -, that might help track >> down the problem, unless it magically makes the problem stop >> triggering (which I guess would be conceivable if this indeed is a >> race). > > OK, will try that second... > I have not been able to reproduce but So looking at linux-4.19.y it looks like 1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check was picked without ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Both of them are marked as Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") so I would expect them to be picked together. ptraceme is potentially updating the task's cred while the access check is running. Try building after picking ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check