Re: [PATCH 1/2] hung_task: Show the blocker task if the task is hung on mutex

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/19/25 11:23 AM, Steven Rostedt wrote:
On Wed, 19 Feb 2025 22:00:49 +0900
"Masami Hiramatsu (Google)" <mhiramat@xxxxxxxxxx> wrote:

From: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>

The "hung_task" shows a long-time uninterruptible slept task, but most
often, it's blocked on a mutex acquired by another task. Without
dumping such a task, investigating the root cause of the hung task
problem is very difficult.

Fortunately CONFIG_DEBUG_MUTEXES=y allows us to identify the mutex
blocking the task. And the mutex has "owner" information, which can
be used to find the owner task and dump it with hung tasks.

With this change, the hung task shows blocker task's info like below;

We've hit bugs like this in the field a few times, and it was very
difficult to debug. Something like this would have made our lives much
easier!
I agree that it will be a useful feature.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
---
  kernel/hung_task.c           |   38 ++++++++++++++++++++++++++++++++++++++
  kernel/locking/mutex-debug.c |    1 +
  kernel/locking/mutex.c       |    9 +++++++++
  kernel/locking/mutex.h       |    6 ++++++
  4 files changed, 54 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 04efa7a6e69b..d1ce69504090 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -25,6 +25,8 @@
#include <trace/events/sched.h> +#include "locking/mutex.h"
+
  /*
   * The number of tasks checked:
   */
@@ -93,6 +95,41 @@ static struct notifier_block panic_block = {
  	.notifier_call = hung_task_panic,
  };
+
+#ifdef CONFIG_DEBUG_MUTEXES
+static void debug_show_blocker(struct task_struct *task)
+{
+	struct task_struct *g, *t;
+	unsigned long owner;
+	struct mutex *lock;
+
+	if (!task->blocked_on)
+		return;
+
+	lock = task->blocked_on->mutex;
This is a catch 22. To look at the task's blocked_on, we need the
lock->wait_lock held, otherwise this could be an issue. But to get that
lock, we need to look at the task's blocked_on field! As this can race.

Another thing is that the waiter is on the task's stack. Perhaps we need to
move this into sched/core.c and be able to lock the task's rq. Because even
something like:

	waiter = READ_ONCE(task->blocked_on);

May be garbage if the task were to suddenly wake up and run.

Now if we were able to lock the task's rq, which would prevent it from
being woken up, then the blocked_on field would not be at risk of being
corrupted.

It is tricky to access the mutex_waiter structure which is allocated from stack. So another way to work around this issue is to add a new blocked_on_mutex field in task_struct to directly point to relevant mutex. Yes, that increase the size of task_struct by 8 bytes, but it is a pretty large structure anyway. Using READ_ONCE/WRITE_ONCE() to access this field, we don't need to take lock, though taking the wait_lock may still be needed to examine other information inside the mutex.

Cheers,
Longman





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux