Re: NOHZ tick-stop error: local softirq work is pending, handler #08!!! on Dell XPS 13 9360

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Frederic,


Thank you very much for your help.

Am 11.02.25 um 16:02 schrieb Frederic Weisbecker:
Le Tue, Feb 11, 2025 at 12:57:33PM +0100, Paul Menzel a écrit :

Am 10.02.25 um 14:26 schrieb Frederic Weisbecker:
Le Mon, Feb 10, 2025 at 12:59:42PM +0100, Paul Menzel a écrit :

Am 10.02.25 um 12:45 schrieb Michał Pecio:

On Dell XPS 13 9360/0596KF, BIOS 2.21.0 06/02/2022, with Linux 6.9-rc2+

Just for the record, I am still seeing this with 6.14.0-rc1

Is this a regression? If so, which versions were not affected?

Unfortunately, I do not know. Right now, my logs go back until September
2024.

      Sep 22 13:08:04 abreu kernel: Linux version 6.11.0-07273-g1e7530883cd2 (build@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc (Debian 14.2.0-5) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #12 SMP PREEMPT_DYNAMIC Sun Sep 22 09:57:36 CEST 2024

How hard to reproduce? Wasn't it during resume from hibernation?

It’s not easy to reproduce, and I believe it’s not related with resuming
from hibernation (which I do not use) or ACPI S3 suspend. I think, I can
force it more, when having the USB-C adapter with only the network cable
plugged into it, and then running `sudo powertop --auto-tune`. But sometimes
it seems unrelated.

IRQ isuses may be a red herring, this code here is a busy wait under
spinlock. There are a few of those, they cause various problems.

                  if (xhci_handshake(&xhci->op_regs->status,
                                 STS_RESTORE, 0, 100 * 1000)) {
                           xhci_warn(xhci, "WARN: xHC restore state timeout\n");
			spin_unlock_irq(&xhci->lock);
                           return -ETIMEDOUT;
                  }

This thing timing out may be close to the root cause of everything.

Interesting. Hopefully the USB folks have an idea.

Handler #08 is NET_RX. So something raised the NET_RX on some non-appropriate
place, perhaps...

Can I ask you one more trace dump?

I need:

echo 1 > /sys/kernel/tracing/events/irq/softirq_raise/enable
echo 1 > /sys/kernel/tracing/options/stacktrace

Unfortunately this will also involve a small patch:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fa058510af9c..accd2eb8c927 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1159,6 +1159,9 @@ static bool report_idle_softirq(void)
   	if (local_bh_blocked())
   		return false;
+	trace_printk("STOP\n");
+	trace_dump_stack(0);
+	tracing_off();
   	pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n",
   		pending);
   	ratelimit++;

Thank you for your help. I applied the patch on top of 6.14-rc2, and was
able to reproduce the issue. Please find the Linux messages attached, and
the trace can be downloaded [1].

So here is the offender:

  => __raise_softirq_irqoff
  => __napi_schedule
  => rtl8152_runtime_resume.isra.0
  => rtl8152_resume
  => usb_resume_interface.isra.0
  => usb_resume_both
  => __rpm_callback
  => rpm_callback
  => rpm_resume
  => __pm_runtime_resume
  => usb_autoresume_device
  => usb_remote_wakeup
  => hub_event
  => process_one_work
  => worker_thread
  => kthread
  => ret_from_fork
  => ret_from_fork_asm

It is calling napi_schedule() from a non-interrupt. And since
____napi_schedule() assumes to be called from an interrupt, it
raises the softirq accordingly without waking up ksoftirqd.

Can you try the following fix (untested, sorry...) ?

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 468c73974046..8f6ea4e7685c 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -8537,8 +8537,11 @@ static int rtl8152_runtime_resume(struct r8152 *tp)
  		clear_bit(SELECTIVE_SUSPEND, &tp->flags);
  		smp_mb__after_atomic();
- if (!list_empty(&tp->rx_done))
+		if (!list_empty(&tp->rx_done)) {
+			local_bh_disable();
  			napi_schedule(&tp->napi);
+			local_bh_enable();
+		}
usb_submit_urb(tp->intr_urb, GFP_NOIO);
  	} else {
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 67964dc4db95..1bd730b881f0 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -619,6 +619,17 @@ do {									\
  		     (!in_softirq() || in_irq() || in_nmi()));		\
  } while (0)
+/*
+ * Assert to be either in hardirq or in serving softirq or with
+ * softirqs disabled. Verifies a safe context to queue a softirq
+ * with __raise_softirq_irqoff().
+ */
+#define lockdep_assert_in_interrupt()				\
+do {								\
+	WARN_ON_ONCE(__lockdep_enabled && !in_interrupt());	\
+} while (0)
+
+
  extern void lockdep_assert_in_softirq_func(void);
#else
@@ -634,6 +645,7 @@ extern void lockdep_assert_in_softirq_func(void);
  # define lockdep_assert_preemption_enabled() do { } while (0)
  # define lockdep_assert_preemption_disabled() do { } while (0)
  # define lockdep_assert_in_softirq() do { } while (0)
+# define lockdep_assert_in_interrupt() do { } while (0)
  # define lockdep_assert_in_softirq_func() do { } while (0)
  #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index c0021cbd28fc..80e415ccf2c8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4666,6 +4666,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
  	struct task_struct *thread;
lockdep_assert_irqs_disabled();
+	lockdep_assert_in_interrupt();
if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
  		/* Paired with smp_mb__before_atomic() in

With this diff applied, I wasn’t able to reproduce the issue. Looks promising. Thank you very much.

Tested-by: Paul Menzel <pmenzel@xxxxxxxxxxxxx>


Kind regards,

Paul




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux