Re: [BUG] printk/nbcon.c: watchdog BUG: softlockup - CPU#x stuck for 78s

John Ogness <john.ogness@xxxxxxxxxxxxx> · Wed, 19 Jun 2024 07:15:31 +0206

[ Explicitly added tglx, hoping he can chime in here. ]

On 2024-06-18, Andrew Halaney <ahalaney@xxxxxxxxxx> wrote:
>> Shouldn't the scheduler eventually kick the task off the CPU after
>> its timeslice is up?
>
> I trust you better than myself about this, but this is being
> reproduced with a CONFIG_PREEMPT_DYNAMIC=y +
> CONFIG_PREEMPT_VOLUNTARY=y setup (so essentially the current mode is
> VOLUNTARY). Does that actually work that way for a kthread in that
> mode?

It would be good not to trust me better than yourself. I actually have
very little experience with the non-RT preemption models. I will need to
investigate this further.

> Just in case I did something dumb, here's the module I wrote up:
>
> ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % cat kernel/printk/test_thread.c                         :(
> /*
>  * Test making a kthread similar to nbcon's (under load)
>  * to see if it also has issues with migrate_swap()
>  */
> #include "linux/nmi.h"
> #include <asm-generic/delay.h>
> #include <linux/kthread.h>
> #include <linux/module.h>
> #include <linux/sched.h>
>
> DEFINE_STATIC_SRCU(test_srcu);
> static DEFINE_SPINLOCK(test_lock);
> static struct task_struct *kt;
> static bool dont_stop = true;
>
> static int test_thread_func(void *unused) {
> 	unsigned long flags;
>
> 	pr_info("Starting the while true loop\n");
> 	do {
> 		int cookie = srcu_read_lock_nmisafe(&test_srcu);
> 		spin_lock_irqsave(&test_lock, flags);
> 		touch_nmi_watchdog();
> 		udelay(5000);  // print a line to serial
> 		spin_unlock_irqrestore(&test_lock, flags);
> 		srcu_read_unlock_nmisafe(&test_srcu, cookie);
> 	} while (dont_stop);
>
> 	return 0;
> }
>
> static int __init test_thread_init(void) {
>
> 	pr_info("Creating test_thread at -20 nice level\n");
> 	kt = kthread_run(test_thread_func, NULL, "test_thread");
> 	if (IS_ERR(kt)) {
> 		pr_err("Failed to make test_thread\n");
> 		return PTR_ERR(kt);
> 	}
> 	sched_set_normal(kt, -20);
>
> 	return 0;
> }
>
> static void __exit test_thread_exit(void) {
> 	dont_stop = false;
> 	kthread_stop(kt);
> }
>
> module_init(test_thread_init);
> module_exit(test_thread_exit);
> MODULE_LICENSE("GPL");

Thanks for the functional test! This should quite accurately reproduce
the situation when the printing thread is unable to catch up to the
amount of incoming messages.

Some function to explicitly trigger the scheduler may be needed. Such as
adding cond_resched() outside the critical section, before repeating the
loop. We would like to remove such explicit preemption points from the
kernel code, but perhaps it is necessary for the VOLUNTARY preemption
scheme.

John