lgoncalv@xxxxxxxxxx, jwyatt@xxxxxxxxxx, aubaker@xxxxxxxxxx
Bcc:
Subject: watchdog BUG: soft lockup - CPU#x stuck for 78s
Reply-To:
Hi,
The realtime team at Red Hat has recently backported the latest
printk changes
present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: move
locked_port flag to
struct uart_port) to CentOS Stream 9 for performance improvements and
printk-related bugfixes.
Since merging this said code, we've hit an interesting bug during
testing,
specifically, on larger systems, a softlockup may be reported by the
watchdog
when there is a heavy amount of printing to tty devices (whether it
be through
printk, /dev/kmsg, etc).
We have a modicum of reasonable suspicion to believe that
nbcon_reacquire, or
some other nbcon mechanism *may* be causing such behavior.
Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN
(osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this
bug
upstream.
Anyway, here is a more in-depth description, along with some call
traces.
Description:
On x86 systems with a large amount of logical cores (nproc ~> 60), a
softlockup can
be observed with accompanying call trace when a large amount of
"printing"
activity is taking place.
As shown in the call traces appended below, during some kind of numa
balancing/numa_migration after a task_numa_fault --where a set of
processess are being migrated/swapped
between two CPUs-- there is a busy thread that is being waited on
(in the order
of seconds), causing a livelock. Additional investigation of
collected vmcores
by toggling panic on softlockup shows that the waiting thread may be
waiting for
a thread looping with nbcon_reacquire.
I suspect that some logic within nbcon_context_try_acquire may be a
good place
to start. My understanding of the code becomes a bit fuzzy here, so
apologies
in advance for any erroneous statements. As I see it, there may be
something
happening during migration (or under heavy load) in which
nbcon_reacquire() is in
a non-migratable or non-preemtible state as it is attempting to
regain access to
a lost/taken console. It could very well be a situation in which
context
was forcefully taken from the printing thread.
Alternatively, Andrew Halaney <ahalaney@xxxxxxxxxx> suspects that it
is the loop within
nbcon_kthread_func() -- since there is nothing that would yield the
task in said
loop (cant_migrate()), the migrate code would be essentially waiting
forever for
the aforementioned loop to "finish". I believe in PREEMPT_RT, there
would be a
preemption point here. Furthermore, in his investigation, there were
signs that the
loop was just iterating up until the crash, leaving reason to
believe that task would be the culprit.
In fact, with the following diff, we noticed this output:
```
ahalaney@x1gen2nano ~/git/linux-rt-devel
(git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index bb9689f94d30..d716b72bf2f8 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console)
*
* This pairs with rcuwait_has_sleeper:A and
nbcon_kthread_wake:A