Hey Song, I'll answer that bit as those were my hacks :P I _thought_ that doing a ftrace_dump_on_oops + softlockup_panic on the command line (or similar at runtime) would dump the logs, but that wasn't working for me and I was in a bit of a rush (and already hacking the kernel up anyways). So I ended up doing a sequence like: tracing_off(); ftrace_dump(DUMP_ALL); in the softlockup code when it was detected. Ideally I wanted to look at the vmcore and look at the ftrace data in there (since debugging printk by using printk is a little confusing), but there was a makedumpfile bug I hit... so I went with the hacky route to prove to myself what was going on. I think since then that's been resolved. Hope that helps! Thanks, Andrew On Tue, Jun 25, 2024 at 09:36:54AM GMT, Song Chen wrote: > Hi Derek, > > I am working on a hungtask case, i saw your trace_printk logs go to the same > output with softlockup warning, It's a smart way to easily build a timeline > for analysis. > > As far as i know, trace_printk goes to ftrace ring buffer and softlockup > info goes to dmesg. Could you please let me know how you did that, i marked > the part i'm interested in below, thanks a lot. > > BR > > Song > > > 在 2024/6/18 23:20, Derek Barbosa 写道: > > lgoncalv@xxxxxxxxxx, jwyatt@xxxxxxxxxx, aubaker@xxxxxxxxxx > > Bcc: > > Subject: watchdog BUG: soft lockup - CPU#x stuck for 78s > > Reply-To: > > > > Hi, > > > > The realtime team at Red Hat has recently backported the latest printk changes > > present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: move locked_port flag to > > struct uart_port) to CentOS Stream 9 for performance improvements and > > printk-related bugfixes. > > > > Since merging this said code, we've hit an interesting bug during testing, > > specifically, on larger systems, a softlockup may be reported by the watchdog > > when there is a heavy amount of printing to tty devices (whether it be through > > printk, /dev/kmsg, etc). > > > > We have a modicum of reasonable suspicion to believe that nbcon_reacquire, or > > some other nbcon mechanism *may* be causing such behavior. > > > > Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN > > (osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this bug > > upstream. > > > > Anyway, here is a more in-depth description, along with some call traces. > > > > Description: > > > > On x86 systems with a large amount of logical cores (nproc ~> 60), a softlockup can > > be observed with accompanying call trace when a large amount of "printing" > > activity is taking place. > > > > As shown in the call traces appended below, during some kind of numa > > balancing/numa_migration after a task_numa_fault --where a set of processess are being migrated/swapped > > between two CPUs-- there is a busy thread that is being waited on (in the order > > of seconds), causing a livelock. Additional investigation of collected vmcores > > by toggling panic on softlockup shows that the waiting thread may be waiting for > > a thread looping with nbcon_reacquire. > > > > I suspect that some logic within nbcon_context_try_acquire may be a good place > > to start. My understanding of the code becomes a bit fuzzy here, so apologies > > in advance for any erroneous statements. As I see it, there may be something > > happening during migration (or under heavy load) in which nbcon_reacquire() is in > > a non-migratable or non-preemtible state as it is attempting to regain access to > > a lost/taken console. It could very well be a situation in which context > > was forcefully taken from the printing thread. > > > > Alternatively, Andrew Halaney <ahalaney@xxxxxxxxxx> suspects that it is the loop within > > nbcon_kthread_func() -- since there is nothing that would yield the task in said > > loop (cant_migrate()), the migrate code would be essentially waiting forever for > > the aforementioned loop to "finish". I believe in PREEMPT_RT, there would be a > > preemption point here. Furthermore, in his investigation, there were signs that the > > loop was just iterating up until the crash, leaving reason to > > believe that task would be the culprit. > > > > In fact, with the following diff, we noticed this output: > > > > ``` > > ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat > > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c > > index bb9689f94d30..d716b72bf2f8 100644 > > --- a/kernel/printk/nbcon.c > > +++ b/kernel/printk/nbcon.c > > @@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console) > > * > > * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A >