On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@xxxxxxxxx> wrote: > Hi Andrew, > > Thanks a lot for paying attention! > > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton > <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@xxxxxxxxx> wrote: > > > > > Hi all, > > > > > > This patchset adds a counter, hung_task_detect_count, to track the number of > > > times hung tasks are detected. This counter provides a straightforward way > > > to monitor hung task events without manually checking dmesg logs. > > > > > > With this counter in place, system issues can be spotted quickly, allowing > > > admins to step in promptly before system load spikes occur, even if the > > > hung_task_warnings value has been decreased to 0 well before. > > > > > > Recently, we encountered a situation where warnings about hung tasks were > > > buried in dmesg logs during load spikes. Introducing this counter could > > > have helped us detect such issues earlier and improve our analysis efficiency. > > > > > > > Isn't the answer to this problem "write a better parser"? I mean, > > Yeah, I certainly agree that having a good parser is important, and I'm > working on that as well ;) > > > we're providing userspace with information which is already available. > > IHMO, there are two reasons why this counter remains valuable: > > 1) It allows us to easily detect hung tasks in time before load spikes occur, > using simple and common monitoring tools like Prometheus. But the new sysctl_hung_task_detect_count counter gets incremented a microsecond before the printk comes out. I don't understand the difference. > 2) It ensures that we remain aware of hung tasks even when the > hung_task_warnings value has already been decreased to 0 well before. That makes sense, I guess. But fleshing this out with a real operational scenario would help persuade reviewers of the benefit of this change. So please describe the utility with full details - sell it to us!