On Tue, Jun 18, 2024 at 09:03:00PM GMT, John Ogness wrote: > Hi Derek, > > On 2024-06-18, Derek Barbosa <debarbos@xxxxxxxxxx> wrote: > > The realtime team at Red Hat has recently backported the latest printk > > changes present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: > > move locked_port flag to struct uart_port) to CentOS Stream 9 for > > performance improvements and printk-related bugfixes. > > > > Since merging this said code, we've hit an interesting bug during > > testing, specifically, on larger systems, a softlockup may be reported > > by the watchdog when there is a heavy amount of printing to tty > > devices (whether it be through printk, /dev/kmsg, etc). > > > > We have a modicum of reasonable suspicion to believe that > > nbcon_reacquire, or some other nbcon mechanism *may* be causing such > > behavior. > > > > Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN > > (osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this > > bug upstream. > > > > Anyway, here is a more in-depth description, along with some call traces. > > > > Description: > > > > On x86 systems with a large amount of logical cores (nproc ~> 60), a > > softlockup can be observed with accompanying call trace when a large > > amount of "printing" activity is taking place. > > > > As shown in the call traces appended below, during some kind of numa > > balancing/numa_migration after a task_numa_fault --where a set of > > processess are being migrated/swapped between two CPUs-- there is a > > busy thread that is being waited on (in the order of seconds), causing > > a livelock. Additional investigation of collected vmcores by toggling > > panic on softlockup shows that the waiting thread may be waiting for a > > thread looping with nbcon_reacquire. > > > > I suspect that some logic within nbcon_context_try_acquire may be a > > good place to start. My understanding of the code becomes a bit fuzzy > > here, so apologies in advance for any erroneous statements. As I see > > it, there may be something happening during migration (or under heavy > > load) in which nbcon_reacquire() is in a non-migratable or > > non-preemtible state as it is attempting to regain access to a > > lost/taken console. It could very well be a situation in which context > > was forcefully taken from the printing thread. > > > > Alternatively, Andrew Halaney <ahalaney@xxxxxxxxxx> suspects that it > > is the loop within nbcon_kthread_func() -- since there is nothing that > > would yield the task in said loop (cant_migrate()), the migrate code > > would be essentially waiting forever for the aforementioned loop to > > "finish". I believe in PREEMPT_RT, there would be a preemption point > > here. Furthermore, in his investigation, there were signs that the > > loop was just iterating up until the crash, leaving reason to believe > > that task would be the culprit. > > > > In fact, with the following diff, we noticed this output: > > > > ``` > > ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat > > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c > > index bb9689f94d30..d716b72bf2f8 100644 > > --- a/kernel/printk/nbcon.c > > +++ b/kernel/printk/nbcon.c > > @@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console) > > * > > * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A. > > */ > > + trace_printk("Before rcuwait_wait_event()\n"); > > ret = rcuwait_wait_event(&con->rcuwait, > > nbcon_kthread_should_wakeup(con, ctxt), > > TASK_INTERRUPTIBLE); /* LMM(nbcon_kthread_func:A) */ > > @@ -1086,7 +1087,10 @@ static int nbcon_kthread_func(void *__console) > > if (ret) > > goto wait_for_event; > > > > + trace_printk("Starting do while(backlog) loop \n"); > > + unsigned long int loop = 0; > > do { > > + trace_printk("On loop %lu\n", ++loop); > > backlog = false; > > > > cookie = console_srcu_read_lock(); > > @@ -1121,6 +1125,7 @@ static int nbcon_kthread_func(void *__console) > > console_srcu_read_unlock(cookie); > > > > } while (backlog); > > + trace_printk("End of while(backlog) loop, looped %lu times \n", loop); > > > > goto wait_for_event; > > ``` > > > > The output: > > > > ``` > > [ 1681.309720] pr/ttyS0-18 21.N... 893365994us : nbcon_kthread_func: On loop 2117 > > [ 1681.309727] pr/ttyS0-18 21.N... 893374418us : nbcon_kthread_func: On loop 2118 > > [ 1681.309734] pr/ttyS0-18 21.N... 893382860us : nbcon_kthread_func: On loop 2119 > > (...) > > [ 1681.396193] pr/ttyS0-18 21.N... 954571399us : nbcon_kthread_func: On loop 14025 > > [ 1681.396200] pr/ttyS0-18 21.N... 954576525us : nbcon_kthread_func: On loop 14026 > > [ 1681.396207] pr/ttyS0-18 21.N... 954581561us : nbcon_kthread_func: On loop 14027 > > [ 1681.396213] pr/ttyS0-18 21.N... 954584954us : nbcon_kthread_func: On loop 14028 > > [ 1681.396220] pr/ttyS0-18 21.N... 954590111us : nbcon_kthread_func: On loop 14029 > > [ 1681.396223] --------------------------------- > > [ 1681.396230] Kernel panic - not syncing: softlockup: hung tasks > > (...) > > ``` > > > > Demonstrating evidence that the nbcon kthread function is invoked > > continuously up until the point of panic. I do believe that this > > approach is more sound than my initial hints (or it could be more than > > a handful of threads). Some traces around > > serial8250_console_write_thread() also denote continuous calls without > > any holdups. > > > > As a sidenote, we are seeing the softlockup with !CONFIG_PREEMPT_RT > > This trace shows that the thread is successfully printing > lines. So I doubt nbcon_reacquire() is involved here. > > Assuming the ringbuffer is getting filled as fast or faster than the > thread can print, then we effectively have this: > > DEFINE_STATIC_SRCU(test_srcu); > static DEFINE_SPINLOCK(test_lock); > > static int kthread_func(void *arg) > { > unsigned long flags; > > do { > srcu_read_lock_nmisafe(&test_srcu); > spin_lock_irqsave(&test_lock, flags); > udelay(5000); // print a line to serial > spin_unlock_irqrestore(&test_lock, flags); > srcu_read_unlock_nmisafe(&test_srcu); > } while (true); > > return 0; > } > > And since the thread has a nice value of -20, it will get a lot of CPU > time allocated to it. Is that a problem? Shouldn't the scheduler > eventually kick the task off the CPU after its timeslice is up? I trust you better than myself about this, but this is being reproduced with a CONFIG_PREEMPT_DYNAMIC=y + CONFIG_PREEMPT_VOLUNTARY=y setup (so essentially the current mode is VOLUNTARY). Does that actually work that way for a kthread in that mode? I've been trying to reason with myself on when the scheduler actually will get involved and stop the above kthread_func() to run something else. > > > Some questions arise from this, as we've never encountered this in our > > testing with John Ogness' console_blast (kudos to <jwyatt@xxxxxxxxxx>) > > and other printk torture tests that have been compiled > > [here](https://gitlab.com/debarbos/printktorture). > > Yes, that is odd since those tests will ensure that the printing thread > never exits its printing loop because it will never catch up. So it > should be the same situation. > > > We are curious to understand why is it that the printing thread is > > chosen by the NUMA balancer for migration/swap, and how that > > interaction is handled within the code (in other words, how else would > > nbcon handle a migrated printing thread?). > > The nbcon console can only be owned when migration is disabled. In the > case of the printing thread for serial, this is under the > spin_lock_irqsave(). The NUMA balancer would only be able to migrate the > thread outside of the spin_lock critical section. And outside of the > spin_lock critical section, the thread does not own/hold any resources > at all. So it should be no problem to migrate it. > > > Our next round of tests aim to disable numa balancing on > > large-multi-core-systems to determine whether it is the NUMA > > mechanisms + nbcon interactions are at fault here. > > I am curious if starting a kthread using the code I wrote above (with > nice=-20) would cause similar issues. > Just in case I did something dumb, here's the module I wrote up: ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % cat kernel/printk/test_thread.c :( /* * Test making a kthread similar to nbcon's (under load) * to see if it also has issues with migrate_swap() */ #include "linux/nmi.h" #include <asm-generic/delay.h> #include <linux/kthread.h> #include <linux/module.h> #include <linux/sched.h> DEFINE_STATIC_SRCU(test_srcu); static DEFINE_SPINLOCK(test_lock); static struct task_struct *kt; static bool dont_stop = true; static int test_thread_func(void *unused) { unsigned long flags; pr_info("Starting the while true loop\n"); do { int cookie = srcu_read_lock_nmisafe(&test_srcu); spin_lock_irqsave(&test_lock, flags); touch_nmi_watchdog(); udelay(5000); // print a line to serial spin_unlock_irqrestore(&test_lock, flags); srcu_read_unlock_nmisafe(&test_srcu, cookie); } while (dont_stop); return 0; } static int __init test_thread_init(void) { pr_info("Creating test_thread at -20 nice level\n"); kt = kthread_run(test_thread_func, NULL, "test_thread"); if (IS_ERR(kt)) { pr_err("Failed to make test_thread\n"); return PTR_ERR(kt); } sched_set_normal(kt, -20); return 0; } static void __exit test_thread_exit(void) { dont_stop = false; kthread_stop(kt); } module_init(test_thread_init); module_exit(test_thread_exit); MODULE_LICENSE("GPL"); ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % That shows a softlockup quite quickly on the CPU that thread is running on (as opposed to what Derek reports, where migrate_swap() is going on and the softlockup reports on the other CPU in the swapping of tasks). I guess that's because of the touch_nmi_watchdog() happening in serial8250_console_write_thread(). The below is without the touch_nmi_watchdog() in the above snippet (just to show what happens as written in your reply). [ 72.018480] Creating test_thread at -20 nice level [ 72.018632] Starting the while true loop [ 99.673116] watchdog: BUG: soft lockup - CPU#53 stuck for 26s! [test_thread:2628] [ 99.673119] Modules linked in: test_thread rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel mei_me ipmi_si kvm mei acpi_power_meter i2c_i801 mgag200 iTCO_wdt rapl acpi_ipmi i2c_algo_bit iTCO_vendor_support mxm_wmi pcspkr i2c_smbus joydev lpc_ich ioatdma intel_cstate intel_uncore ipmi_devintf ipmi_msghandler acpi_pad fuse xfs sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul ixgbe crc32c_intel libata megaraid_sas ghash_clmulni_intel mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [ 99.673159] CPU: 53 PID: 2628 Comm: test_thread Kdump: loaded Not tainted 6.10.0-rc4-rt6+ #7 [ 99.673162] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016 [ 99.673163] RIP: 0010:_raw_spin_unlock_irqrestore+0x1f/0x40 [ 99.673179] Code: 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 01 fb 65 ff 0d 09 17 9c 4a <74> 06 c3 cc cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc cc 66 66 66 [ 99.673180] RSP: 0018:ffffb10887dafed0 EFLAGS: 00000246 [ 99.673182] RAX: 00000000e3ddc104 RBX: 0000000000000000 RCX: 0000000000001035 [ 99.673183] RDX: 0000000000af11a8 RSI: 0000000000000286 RDI: ffffffffc0e1b700 [ 99.673184] RBP: ffffb1088a4c77b0 R08: 0000000000000035 R09: 0000000000000035 [ 99.673185] R10: 0000000000017ffd R11: ffffffffb5649760 R12: ffff8b5f0caa4f00 [ 99.673186] R13: ffff8b4f87c04e80 R14: 0000000000000286 R15: ffff8b5f200e3380 [ 99.673187] FS: 0000000000000000(0000) GS:ffff8b6ebf880000(0000) knlGS:0000000000000000 [ 99.673189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.673190] CR2: 000055a7bd4f3b68 CR3: 0000001f35820003 CR4: 00000000001706f0 [ 99.673191] Call Trace: [ 99.673192] <IRQ> [ 99.673193] ? watchdog_timer_fn+0x21f/0x2a0 [ 99.673197] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 99.673198] ? __hrtimer_run_queues+0xfa/0x270 [ 99.673202] ? hrtimer_interrupt+0xf4/0x390 [ 99.673205] ? __sysvec_apic_timer_interrupt+0x52/0x160 [ 99.673208] ? sysvec_apic_timer_interrupt+0x6f/0x80 [ 99.673210] </IRQ> [ 99.673211] <TASK> [ 99.673212] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 99.673215] ? __pfx_delay_tsc+0x10/0x10 [ 99.673221] ? _raw_spin_unlock_irqrestore+0x1f/0x40 [ 99.673222] test_thread_func+0x5e/0xff0 [test_thread] [ 99.673225] ? __pfx_test_thread_func+0x10/0x10 [test_thread] [ 99.673226] kthread+0xec/0x110 [ 99.673230] ? __pfx_kthread+0x10/0x10 [ 99.673232] ret_from_fork+0x3a/0x50 [ 99.673235] ? __pfx_kthread+0x10/0x10 [ 99.673237] ret_from_fork_asm+0x1a/0x30 [ 99.673239] </TASK> If you mimic that touch_nmi_watchdog() above the udelay() in the function above, then you have to get the system to decide to migrate. Something like: stress-ng --timeout 60000s --numa 64 seems to help tickle problems out. With that I got a bit different of a backtrace than what Derek reported or above, but I'd guess its a similar root cause: [ 258.175904] Creating test_thread at -20 nice level [ 258.176237] Starting the while true loop [ 495.910816] INFO: task khugepaged:483 blocked for more than 122 seconds. [ 495.910824] Not tainted 6.10.0-rc4-rt6+ #7 [ 495.910827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 495.910829] task:khugepaged state:D stack:0 pid:483 tgid:483 ppid:2 flags:0x00004000 [ 495.910834] Call Trace: [ 495.910836] <TASK> [ 495.910838] __schedule+0x5e9/0x1420 [ 495.910845] ? update_load_avg+0x1fb/0x860 [ 495.910851] ? __update_curr+0x15d/0x3a0 [ 495.910854] schedule+0x6d/0xf0 [ 495.910857] schedule_timeout+0x32/0x1b0 [ 495.910860] ? sched_clock+0x10/0x30 [ 495.910866] wait_for_common+0xfe/0x1c0 [ 495.910869] __flush_work+0x258/0x2d0 [ 495.910875] ? __pfx_wq_barrier_func+0x10/0x10 [ 495.910878] __lru_add_drain_all+0x19d/0x1e0 [ 495.910882] khugepaged+0x160/0xa90 [ 495.910887] ? __pfx_autoremove_wake_function+0x10/0x10 [ 495.910892] ? __pfx_khugepaged+0x10/0x10 [ 495.910894] kthread+0xec/0x110 [ 495.910898] ? __pfx_kthread+0x10/0x10 [ 495.910900] ret_from_fork+0x3a/0x50 [ 495.910902] ? __pfx_kthread+0x10/0x10 [ 495.910904] ret_from_fork_asm+0x1a/0x30 [ 495.910907] </TASK> I got that once this afternoon, haven't gotten anything since (was hoping to see the exact same thing as Derek with that, but oh well). Thanks, Andrew