On Tue 2024-06-18 17:52:19, Andrew Halaney wrote: > On Tue, Jun 18, 2024 at 09:03:00PM GMT, John Ogness wrote: > > Hi Derek, > > > > On 2024-06-18, Derek Barbosa <debarbos@xxxxxxxxxx> wrote: > > > The realtime team at Red Hat has recently backported the latest printk > > > changes present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: > > > move locked_port flag to struct uart_port) to CentOS Stream 9 for > > > performance improvements and printk-related bugfixes. > > > > > > Since merging this said code, we've hit an interesting bug during > > > testing, specifically, on larger systems, a softlockup may be reported > > > by the watchdog when there is a heavy amount of printing to tty > > > devices (whether it be through printk, /dev/kmsg, etc). > > > > > > We have a modicum of reasonable suspicion to believe that > > > nbcon_reacquire, or some other nbcon mechanism *may* be causing such > > > behavior. > > > > > > Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN > > > (osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this > > > bug upstream. > > > > > > Anyway, here is a more in-depth description, along with some call traces. > > > > > > Description: > > > > > > On x86 systems with a large amount of logical cores (nproc ~> 60), a > > > softlockup can be observed with accompanying call trace when a large > > > amount of "printing" activity is taking place. > > > > > > As shown in the call traces appended below, during some kind of numa > > > balancing/numa_migration after a task_numa_fault --where a set of > > > processess are being migrated/swapped between two CPUs-- there is a > > > busy thread that is being waited on (in the order of seconds), causing > > > a livelock. Additional investigation of collected vmcores by toggling > > > panic on softlockup shows that the waiting thread may be waiting for a > > > thread looping with nbcon_reacquire. > > > > > > I suspect that some logic within nbcon_context_try_acquire may be a > > > good place to start. My understanding of the code becomes a bit fuzzy > > > here, so apologies in advance for any erroneous statements. As I see > > > it, there may be something happening during migration (or under heavy > > > load) in which nbcon_reacquire() is in a non-migratable or > > > non-preemtible state as it is attempting to regain access to a > > > lost/taken console. It could very well be a situation in which context > > > was forcefully taken from the printing thread. > > > > > > Alternatively, Andrew Halaney <ahalaney@xxxxxxxxxx> suspects that it > > > is the loop within nbcon_kthread_func() -- since there is nothing that > > > would yield the task in said loop (cant_migrate()), the migrate code > > > would be essentially waiting forever for the aforementioned loop to > > > "finish". I believe in PREEMPT_RT, there would be a preemption point > > > here. Furthermore, in his investigation, there were signs that the > > > loop was just iterating up until the crash, leaving reason to believe > > > that task would be the culprit. > > > > > > In fact, with the following diff, we noticed this output: > > > > > > ``` > > > ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat > > > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c > > > index bb9689f94d30..d716b72bf2f8 100644 > > > --- a/kernel/printk/nbcon.c > > > +++ b/kernel/printk/nbcon.c > > > @@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console) > > > * > > > * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A. > > > */ > > > + trace_printk("Before rcuwait_wait_event()\n"); > > > ret = rcuwait_wait_event(&con->rcuwait, > > > nbcon_kthread_should_wakeup(con, ctxt), > > > TASK_INTERRUPTIBLE); /* LMM(nbcon_kthread_func:A) */ > > > @@ -1086,7 +1087,10 @@ static int nbcon_kthread_func(void *__console) > > > if (ret) > > > goto wait_for_event; > > > > > > + trace_printk("Starting do while(backlog) loop \n"); > > > + unsigned long int loop = 0; > > > do { > > > + trace_printk("On loop %lu\n", ++loop); > > > backlog = false; > > > > > > cookie = console_srcu_read_lock(); > > > @@ -1121,6 +1125,7 @@ static int nbcon_kthread_func(void *__console) > > > console_srcu_read_unlock(cookie); > > > > > > } while (backlog); > > > + trace_printk("End of while(backlog) loop, looped %lu times \n", loop); > > > > > > goto wait_for_event; > > > ``` > > > > > > The output: > > > > > > ``` > > > [ 1681.309720] pr/ttyS0-18 21.N... 893365994us : nbcon_kthread_func: On loop 2117 > > > [ 1681.309727] pr/ttyS0-18 21.N... 893374418us : nbcon_kthread_func: On loop 2118 > > > [ 1681.309734] pr/ttyS0-18 21.N... 893382860us : nbcon_kthread_func: On loop 2119 > > > (...) > > > [ 1681.396193] pr/ttyS0-18 21.N... 954571399us : nbcon_kthread_func: On loop 14025 > > > [ 1681.396200] pr/ttyS0-18 21.N... 954576525us : nbcon_kthread_func: On loop 14026 > > > [ 1681.396207] pr/ttyS0-18 21.N... 954581561us : nbcon_kthread_func: On loop 14027 > > > [ 1681.396213] pr/ttyS0-18 21.N... 954584954us : nbcon_kthread_func: On loop 14028 > > > [ 1681.396220] pr/ttyS0-18 21.N... 954590111us : nbcon_kthread_func: On loop 14029 > > > [ 1681.396223] --------------------------------- > > > [ 1681.396230] Kernel panic - not syncing: softlockup: hung tasks > > > (...) > > > ``` > > > > > > Demonstrating evidence that the nbcon kthread function is invoked > > > continuously up until the point of panic. I do believe that this > > > approach is more sound than my initial hints (or it could be more than > > > a handful of threads). Some traces around > > > serial8250_console_write_thread() also denote continuous calls without > > > any holdups. > > > > > > As a sidenote, we are seeing the softlockup with !CONFIG_PREEMPT_RT > > > > This trace shows that the thread is successfully printing > > lines. So I doubt nbcon_reacquire() is involved here. > > > > Assuming the ringbuffer is getting filled as fast or faster than the > > thread can print, then we effectively have this: > > > > DEFINE_STATIC_SRCU(test_srcu); > > static DEFINE_SPINLOCK(test_lock); > > > > static int kthread_func(void *arg) > > { > > unsigned long flags; > > > > do { > > srcu_read_lock_nmisafe(&test_srcu); > > spin_lock_irqsave(&test_lock, flags); > > udelay(5000); // print a line to serial > > spin_unlock_irqrestore(&test_lock, flags); > > srcu_read_unlock_nmisafe(&test_srcu); > > } while (true); > > > > return 0; > > } > > > > And since the thread has a nice value of -20, it will get a lot of CPU > > time allocated to it. Is that a problem? Shouldn't the scheduler > > eventually kick the task off the CPU after its timeslice is up? > > I trust you better than myself about this, but this is being reproduced > with a CONFIG_PREEMPT_DYNAMIC=y + CONFIG_PREEMPT_VOLUNTARY=y setup (so > essentially the current mode is VOLUNTARY). Does that actually work that > way for a kthread in that mode? I've been trying to reason with myself > on when the scheduler actually will get involved and stop the above > kthread_func() to run something else. > > > > > > Some questions arise from this, as we've never encountered this in our > > > testing with John Ogness' console_blast (kudos to <jwyatt@xxxxxxxxxx>) > > > and other printk torture tests that have been compiled > > > [here](https://gitlab.com/debarbos/printktorture). > > > > Yes, that is odd since those tests will ensure that the printing thread > > never exits its printing loop because it will never catch up. So it > > should be the same situation. > > > > > We are curious to understand why is it that the printing thread is > > > chosen by the NUMA balancer for migration/swap, and how that > > > interaction is handled within the code (in other words, how else would > > > nbcon handle a migrated printing thread?). > > > > The nbcon console can only be owned when migration is disabled. In the > > case of the printing thread for serial, this is under the > > spin_lock_irqsave(). The NUMA balancer would only be able to migrate the > > thread outside of the spin_lock critical section. And outside of the > > spin_lock critical section, the thread does not own/hold any resources > > at all. So it should be no problem to migrate it. > > > > > Our next round of tests aim to disable numa balancing on > > > large-multi-core-systems to determine whether it is the NUMA > > > mechanisms + nbcon interactions are at fault here. > > > > I am curious if starting a kthread using the code I wrote above (with > > nice=-20) would cause similar issues. > > > > Just in case I did something dumb, here's the module I wrote up: > > ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % cat kernel/printk/test_thread.c :( > /* > * Test making a kthread similar to nbcon's (under load) > * to see if it also has issues with migrate_swap() > */ > #include "linux/nmi.h" > #include <asm-generic/delay.h> > #include <linux/kthread.h> > #include <linux/module.h> > #include <linux/sched.h> > > DEFINE_STATIC_SRCU(test_srcu); > static DEFINE_SPINLOCK(test_lock); > static struct task_struct *kt; > static bool dont_stop = true; > > static int test_thread_func(void *unused) { > unsigned long flags; > > pr_info("Starting the while true loop\n"); > do { > int cookie = srcu_read_lock_nmisafe(&test_srcu); > spin_lock_irqsave(&test_lock, flags); > touch_nmi_watchdog(); > udelay(5000); // print a line to serial > spin_unlock_irqrestore(&test_lock, flags); > srcu_read_unlock_nmisafe(&test_srcu, cookie); Does it help to add here? cond_resched(); > } while (dont_stop); > > return 0; > } > > static int __init test_thread_init(void) { > > pr_info("Creating test_thread at -20 nice level\n"); > kt = kthread_run(test_thread_func, NULL, "test_thread"); > if (IS_ERR(kt)) { > pr_err("Failed to make test_thread\n"); > return PTR_ERR(kt); > } > sched_set_normal(kt, -20); > > return 0; > } > > static void __exit test_thread_exit(void) { > dont_stop = false; > kthread_stop(kt); > } > > module_init(test_thread_init); > module_exit(test_thread_exit); > MODULE_LICENSE("GPL"); > ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % > > That shows a softlockup quite quickly on the CPU that thread is > running on (as opposed to what Derek reports, where migrate_swap() is > going on and the softlockup reports on the other CPU in the swapping of > tasks). I guess that's because of the touch_nmi_watchdog() happening > in serial8250_console_write_thread(). > > The below is without the touch_nmi_watchdog() in the above snippet > (just to show what happens as written in your reply). > > [ 72.018480] Creating test_thread at -20 nice level > [ 72.018632] Starting the while true loop > [ 99.673116] watchdog: BUG: soft lockup - CPU#53 stuck for 26s! [test_thread:2628] > [ 99.673119] Modules linked in: test_thread rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel mei_me ipmi_si kvm mei acpi_power_meter i2c_i801 mgag200 iTCO_wdt rapl acpi_ipmi i2c_algo_bit iTCO_vendor_support mxm_wmi pcspkr i2c_smbus joydev lpc_ich ioatdma intel_cstate intel_uncore ipmi_devintf ipmi_msghandler acpi_pad fuse xfs sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul ixgbe crc32c_intel libata megaraid_sas ghash_clmulni_intel mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod > [ 99.673159] CPU: 53 PID: 2628 Comm: test_thread Kdump: loaded Not tainted 6.10.0-rc4-rt6+ #7 > [ 99.673162] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016 > [ 99.673163] RIP: 0010:_raw_spin_unlock_irqrestore+0x1f/0x40 This is the point where the interrupts were enabled. It allowed to handle pending timers, including watchdog_timer_fn(). It printed this softlockup report. It means that test_thread() never scheduled in the do while() loop. I believe that the cond_resched() would cure the problem. > [ 99.673179] Code: 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 01 fb 65 ff 0d 09 17 9c 4a <74> 06 c3 cc cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc cc 66 66 66 > [ 99.673180] RSP: 0018:ffffb10887dafed0 EFLAGS: 00000246 > [ 99.673182] RAX: 00000000e3ddc104 RBX: 0000000000000000 RCX: 0000000000001035 > [ 99.673183] RDX: 0000000000af11a8 RSI: 0000000000000286 RDI: ffffffffc0e1b700 > [ 99.673184] RBP: ffffb1088a4c77b0 R08: 0000000000000035 R09: 0000000000000035 > [ 99.673185] R10: 0000000000017ffd R11: ffffffffb5649760 R12: ffff8b5f0caa4f00 > [ 99.673186] R13: ffff8b4f87c04e80 R14: 0000000000000286 R15: ffff8b5f200e3380 > [ 99.673187] FS: 0000000000000000(0000) GS:ffff8b6ebf880000(0000) knlGS:0000000000000000 > [ 99.673189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 99.673190] CR2: 000055a7bd4f3b68 CR3: 0000001f35820003 CR4: 00000000001706f0 > [ 99.673191] Call Trace: > [ 99.673192] <IRQ> > [ 99.673193] ? watchdog_timer_fn+0x21f/0x2a0 > [ 99.673197] ? __pfx_watchdog_timer_fn+0x10/0x10 > [ 99.673198] ? __hrtimer_run_queues+0xfa/0x270 > [ 99.673202] ? hrtimer_interrupt+0xf4/0x390 > [ 99.673205] ? __sysvec_apic_timer_interrupt+0x52/0x160 > [ 99.673208] ? sysvec_apic_timer_interrupt+0x6f/0x80 > [ 99.673210] </IRQ> > [ 99.673211] <TASK> > [ 99.673212] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 99.673215] ? __pfx_delay_tsc+0x10/0x10 > [ 99.673221] ? _raw_spin_unlock_irqrestore+0x1f/0x40 > [ 99.673222] test_thread_func+0x5e/0xff0 [test_thread] > [ 99.673225] ? __pfx_test_thread_func+0x10/0x10 [test_thread] > [ 99.673226] kthread+0xec/0x110 > [ 99.673230] ? __pfx_kthread+0x10/0x10 > [ 99.673232] ret_from_fork+0x3a/0x50 > [ 99.673235] ? __pfx_kthread+0x10/0x10 > [ 99.673237] ret_from_fork_asm+0x1a/0x30 > [ 99.673239] </TASK> > > > If you mimic that touch_nmi_watchdog() above the udelay() in the function above, The touch_nmi_watchdog() caused that watchdog_timer_fn() did not see that "test_thread" kthread did not schedule. By other words, it did hide the problem. > then you have to get the system to decide to migrate. Something like: > > stress-ng --timeout 60000s --numa 64 > > seems to help tickle problems out. With that I got a bit different of a > backtrace than what Derek reported or above, but I'd guess its a similar > root cause: > > [ 258.175904] Creating test_thread at -20 nice level > [ 258.176237] Starting the while true loop > [ 495.910816] INFO: task khugepaged:483 blocked for more than 122 seconds. > [ 495.910824] Not tainted 6.10.0-rc4-rt6+ #7 This is from check_hung_task(). It means that the task TASK_UNINTERRUPTIBLE sleep for more than 122 second. > [ 495.910827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 495.910829] task:khugepaged state:D stack:0 pid:483 tgid:483 ppid:2 flags:0x00004000 > [ 495.910834] Call Trace: > [ 495.910836] <TASK> > [ 495.910838] __schedule+0x5e9/0x1420 > [ 495.910845] ? update_load_avg+0x1fb/0x860 > [ 495.910851] ? __update_curr+0x15d/0x3a0 > [ 495.910854] schedule+0x6d/0xf0 > [ 495.910857] schedule_timeout+0x32/0x1b0 > [ 495.910860] ? sched_clock+0x10/0x30 > [ 495.910866] wait_for_common+0xfe/0x1c0 > [ 495.910869] __flush_work+0x258/0x2d0 I guess that it is the wait_for_completion() called from __flush_work(). > [ 495.910875] ? __pfx_wq_barrier_func+0x10/0x10 > [ 495.910878] __lru_add_drain_all+0x19d/0x1e0 __lru_add_drain_all() does for_each_cpu(cpu, &has_work) flush_work(&per_cpu(lru_add_drain_work, cpu)); It is most likely blocked when waiting on the CPU which is occupied by the "test_thread" kthread. IMHO, it is not related to a process migration. It is just yet another way how to get blocked by a task which is not scheduling. > [ 495.910882] khugepaged+0x160/0xa90 > [ 495.910887] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 495.910892] ? __pfx_khugepaged+0x10/0x10 > [ 495.910894] kthread+0xec/0x110 > [ 495.910898] ? __pfx_kthread+0x10/0x10 > [ 495.910900] ret_from_fork+0x3a/0x50 > [ 495.910902] ? __pfx_kthread+0x10/0x10 > [ 495.910904] ret_from_fork_asm+0x1a/0x30 > [ 495.910907] </TASK> > > I got that once this afternoon, haven't gotten anything since (was > hoping to see the exact same thing as Derek with that, but oh well). Thanks a lot for proving these detailed debugging information. I think that we need to add the cond_resched() into the do-while() loop in nbcon_kthread_func(). Otherwise, the kthread would not schedule in this loop with a voluntary preemption. And it is a great catch! I haven't realized this when reviewing the code. Best Regards, Petr