On Thu, Jan 31, 2019 at 12:26:48AM +0000, Tom Putzeys wrote: > Hi all, > > I am trying to debug a series of random system freezes / lockups by making use of the kernel lockup detector / NMI watchdog to trigger a kernel crash dump when a system lockup occurs. > > We are running the 4.14.93-rt kernel on a quad-core x86_64 Intel Atom SMP machine. > > The lockup detector is fully enabled and configured to trigger a panic when a hard or soft lockup occurs: > CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y > CONFIG_LOCKUP_DETECTOR=y > CONFIG_SOFTLOCKUP_DETECTOR=y > CONFIG_HARDLOCKUP_DETECTOR_PERF=y > CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y > CONFIG_HARDLOCKUP_DETECTOR=y > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1 > > I also set the correct sysctl variables: > kernel.panic = 1 > kernel.panic_on_oops = 1 > kernel.unknown_nmi_panic = 1 > kernel.panic_on_unrecovered_nmi = 1 > kernel.panic_on_io_nmi = 1 > kernel.softlockup_panic = 1 > kernel.hung_task_panic = 1 > > I also enabled the NMI watchdog via the kernel cmdline (nmi_watchdog=1). > > I configured my system to generate a kernel crash dump using kdump / > kexec when a panic occurs. When I trigger a manual kernel panic via > /proc/sysrq-trigger, the crash dump mechanism works perfectly. I see a > switch to my dump-capture kernel and ramdisk. > > The problem: when a real-life lockup or system freeze occurs, the system > just reboots without generating a crash dump. There is no switch to the > dump-capture kernel. AFAIK, there is no panic. I find nothing in the logs > and nothing appears on the console. > > To replicate the problem: I wrote a small program that runs an infinite > nop while loop. When running this program on all 4 cores with max. > real-time priority (SCHED_FIFO) to hog the CPU, I get a complete system > lockup (no keyboard input, no serial console, no ping reply). This freeze > then triggers a reboot (I guess when the watchdog kicks in) but no crash > dump or no visible kernel panic. > > I find it strange that the RT throttling mechanism does not prevent a > freeze in this case (we did not disable it), but apart from that, I guess > my hog application should be detected as a hung task and cause a panic. With regards to RT throttling, you probably need to run this command line: # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features In short, a CPU about to hit the point where the RT throttling mechanism would kick in can borrow "RT time" from other CPUs that are not running RT tasks. That is the default behavior and the command line above disables that feature. You could also, for testing purposes, increase the slice of CPU time reserved for non-rt tasks. You could try 10% instead of the usual 5%: # echo 1000000 > /proc/sys/kernel/sched_rt_period_us # echo 900000 > /proc/sys/kernel/sched_rt_runtime_us Luis