Re: [PATCH 0/2] hung_task: add detect count for hung tasks

Lance Yang <ioworker0@xxxxxxxxx> · Thu, 24 Oct 2024 16:48:45 +0800

On Thu, Oct 24, 2024 at 12:28 PM Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@xxxxxxxxx> wrote:
>
> > Hi Andrew,
> >
> > Thanks a lot for paying attention!
> >
> > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
> > <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@xxxxxxxxx> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This patchset adds a counter, hung_task_detect_count, to track the number of
> > > > times hung tasks are detected. This counter provides a straightforward way
> > > > to monitor hung task events without manually checking dmesg logs.
> > > >
> > > > With this counter in place, system issues can be spotted quickly, allowing
> > > > admins to step in promptly before system load spikes occur, even if the
> > > > hung_task_warnings value has been decreased to 0 well before.
> > > >
> > > > Recently, we encountered a situation where warnings about hung tasks were
> > > > buried in dmesg logs during load spikes. Introducing this counter could
> > > > have helped us detect such issues earlier and improve our analysis efficiency.
> > > >
> > >
> > > Isn't the answer to this problem "write a better parser"?  I mean,
> >
> > Yeah, I certainly agree that having a good parser is important, and I'm
> > working on that as well ;)
> >
> > > we're providing userspace with information which is already available.
> >
> > IHMO, there are two reasons why this counter remains valuable:
> >
> > 1) It allows us to easily detect hung tasks in time before load spikes occur,
> > using simple and common monitoring tools like Prometheus.
>
> But the new sysctl_hung_task_detect_count counter gets incremented a
> microsecond before the printk comes out.  I don't understand the
> difference.
>
> > 2) It ensures that we remain aware of hung tasks even when the
> > hung_task_warnings value has already been decreased to 0 well before.
>
> That makes sense, I guess.  But fleshing this out with a real
> operational scenario would help persuade reviewers of the benefit of
> this change.
>
> So please describe the utility with full details - sell it to us!

Thanks, the suggestion is very helpful!

IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly
as using a counter.

Sometimes, a short-lived issue with the NIC or hard drive can quickly
decrease the hung_task_warnings to zero. Without warnings, we must
directly access the node to ensure that there are no more hung tasks
and that the system has recovered. After all, load alone cannot provide
a clear picture.

Once this counter is in place, in a high-density deployment pattern, we plan
to set hung_task_timeout_secs to a lower number to improve stability, even
though this might result in false positives. And then we can set a time-based
threshold: if hung tasks last beyond this duration, we will automatically
migrate containers to other nodes. Based on past experience, this approach
could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)

Thanks,
Lance