On 7/25/19 11:38 AM, Tetsuo Handa wrote: > On 2019/07/25 2:02, Dmitry Safonov wrote: >> Hung task detector has one timeout and has two associated actions on it: >> - issuing warnings with names and stacks of blocked tasks >> - panic() >> >> We want switches to panic (and reboot) if there's a task >> in uninterruptible sleep for some minutes - at that moment something >> ugly has happened and the box needs a reboot. >> But we also want to detect conditions that are "out of range" >> or approaching the point of failure. Under such conditions we want >> to issue an "early warning" of an impending failure, minutes before >> the switch is going to panic. > > Can't we do it by extending sysctl_hung_task_panic to accept values larger > than 1, and decrease by one when at least one thread was reported by each > check_hung_uninterruptible_tasks() check, and call panic() when > sysctl_hung_task_panic reached to 0 (or maybe 1 is simpler) ? > > Hmm, might have the same problem regarding how/when to reset the counter. > If some userspace process can reset the counter, such process can trigger > SysRq-c when some period expired... Yes, also current distributions already using the counter to print warnings number of times and then silently ignore. I.e., on my Arch Linux setup: hung_task_warnings:10 >> It seems rather easy to add printing tasks and their stacks for >> notification and debugging purposes into hung task detector without >> complicating the code or major cost (prints are with KERN_INFO loglevel >> and so don't go on console, only into dmesg log). > > Well, I don't think so. Might be noisy for systems without "quiet" kernel > command line option, and we can't pass KERN_DEBUG to e.g. sched_show_task()... Yes, that's why it's disabled by default (=0). I tend to agree that printing with KERN_DEBUG may be better, but in my point of view the patch isn't enough justification for patching sched_show_task() and show_stack(). Thanks, Dmitry