Re: [BUG] printk/nbcon.c: watchdog BUG: softlockup - CPU#x stuck for 78s

John Ogness <john.ogness@xxxxxxxxxxxxx> · Tue, 18 Jun 2024 21:03:00 +0206

Hi Derek,

On 2024-06-18, Derek Barbosa <debarbos@xxxxxxxxxx> wrote:
> The realtime team at Red Hat has recently backported the latest printk
> changes present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon:
> move locked_port flag to struct uart_port) to CentOS Stream 9 for
> performance improvements and printk-related bugfixes.
>
> Since merging this said code, we've hit an interesting bug during
> testing, specifically, on larger systems, a softlockup may be reported
> by the watchdog when there is a heavy amount of printing to tty
> devices (whether it be through printk, /dev/kmsg, etc).
>
> We have a modicum of reasonable suspicion to believe that
> nbcon_reacquire, or some other nbcon mechanism *may* be causing such
> behavior.
>
> Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN
> (osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this
> bug upstream.
>
> Anyway, here is a more in-depth description, along with some call traces.
>
> Description:
>
> On x86 systems with a large amount of logical cores (nproc ~> 60), a
> softlockup can be observed with accompanying call trace when a large
> amount of "printing" activity is taking place.
>
> As shown in the call traces appended below, during some kind of numa
> balancing/numa_migration after a task_numa_fault --where a set of
> processess are being migrated/swapped between two CPUs-- there is a
> busy thread that is being waited on (in the order of seconds), causing
> a livelock. Additional investigation of collected vmcores by toggling
> panic on softlockup shows that the waiting thread may be waiting for a
> thread looping with nbcon_reacquire.
>
> I suspect that some logic within nbcon_context_try_acquire may be a
> good place to start. My understanding of the code becomes a bit fuzzy
> here, so apologies in advance for any erroneous statements. As I see
> it, there may be something happening during migration (or under heavy
> load) in which nbcon_reacquire() is in a non-migratable or
> non-preemtible state as it is attempting to regain access to a
> lost/taken console. It could very well be a situation in which context
> was forcefully taken from the printing thread.
>
> Alternatively, Andrew Halaney <ahalaney@xxxxxxxxxx> suspects that it
> is the loop within nbcon_kthread_func() -- since there is nothing that
> would yield the task in said loop (cant_migrate()), the migrate code
> would be essentially waiting forever for the aforementioned loop to
> "finish". I believe in PREEMPT_RT, there would be a preemption point
> here. Furthermore, in his investigation, there were signs that the
> loop was just iterating up until the crash, leaving reason to believe
> that task would be the culprit.
>
> In fact, with the following diff, we noticed this output:
>
> ```
> ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index bb9689f94d30..d716b72bf2f8 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console)
>       *
>       * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A.
>       */
> +    trace_printk("Before rcuwait_wait_event()\n");
>      ret = rcuwait_wait_event(&con->rcuwait,
>                   nbcon_kthread_should_wakeup(con, ctxt),
>                   TASK_INTERRUPTIBLE); /* LMM(nbcon_kthread_func:A) */
> @@ -1086,7 +1087,10 @@ static int nbcon_kthread_func(void *__console)
>      if (ret)
>          goto wait_for_event;
>
> +    trace_printk("Starting do while(backlog) loop \n");
> +    unsigned long int loop = 0;
>      do {
> +        trace_printk("On loop %lu\n", ++loop);
>          backlog = false;
>
>          cookie = console_srcu_read_lock();
> @@ -1121,6 +1125,7 @@ static int nbcon_kthread_func(void *__console)
>          console_srcu_read_unlock(cookie);
>
>      } while (backlog);
> +    trace_printk("End of while(backlog) loop, looped %lu times \n", loop);
>
>      goto wait_for_event;
> ```
>
> The output:
>
> ```
> [ 1681.309720] pr/ttyS0-18       21.N... 893365994us : nbcon_kthread_func: On loop 2117
> [ 1681.309727] pr/ttyS0-18       21.N... 893374418us : nbcon_kthread_func: On loop 2118
> [ 1681.309734] pr/ttyS0-18       21.N... 893382860us : nbcon_kthread_func: On loop 2119
> (...)
> [ 1681.396193] pr/ttyS0-18       21.N... 954571399us : nbcon_kthread_func: On loop 14025
> [ 1681.396200] pr/ttyS0-18       21.N... 954576525us : nbcon_kthread_func: On loop 14026
> [ 1681.396207] pr/ttyS0-18       21.N... 954581561us : nbcon_kthread_func: On loop 14027
> [ 1681.396213] pr/ttyS0-18       21.N... 954584954us : nbcon_kthread_func: On loop 14028
> [ 1681.396220] pr/ttyS0-18       21.N... 954590111us : nbcon_kthread_func: On loop 14029
> [ 1681.396223] ---------------------------------
> [ 1681.396230] Kernel panic - not syncing: softlockup: hung tasks
> (...)
> ```
>
> Demonstrating evidence that the nbcon kthread function is invoked
> continuously up until the point of panic. I do believe that this
> approach is more sound than my initial hints (or it could be more than
> a handful of threads). Some traces around
> serial8250_console_write_thread() also denote continuous calls without
> any holdups.
>
> As a sidenote, we are seeing the softlockup with !CONFIG_PREEMPT_RT

This trace shows that the thread is successfully printing
lines. So I doubt nbcon_reacquire() is involved here.

Assuming the ringbuffer is getting filled as fast or faster than the
thread can print, then we effectively have this:

DEFINE_STATIC_SRCU(test_srcu);
static DEFINE_SPINLOCK(test_lock);

static int kthread_func(void *arg)
{
	unsigned long flags;

	do {
		srcu_read_lock_nmisafe(&test_srcu);
		spin_lock_irqsave(&test_lock, flags);
		udelay(5000);  // print a line to serial
		spin_unlock_irqrestore(&test_lock, flags);
		srcu_read_unlock_nmisafe(&test_srcu);
	} while (true);

	return 0;
}

And since the thread has a nice value of -20, it will get a lot of CPU
time allocated to it. Is that a problem? Shouldn't the scheduler
eventually kick the task off the CPU after its timeslice is up?

> Some questions arise from this, as we've never encountered this in our
> testing with John Ogness' console_blast (kudos to <jwyatt@xxxxxxxxxx>)
> and other printk torture tests that have been compiled
> [here](https://gitlab.com/debarbos/printktorture).

Yes, that is odd since those tests will ensure that the printing thread
never exits its printing loop because it will never catch up. So it
should be the same situation.

> We are curious to understand why is it that the printing thread is
> chosen by the NUMA balancer for migration/swap, and how that
> interaction is handled within the code (in other words, how else would
> nbcon handle a migrated printing thread?).

The nbcon console can only be owned when migration is disabled. In the
case of the printing thread for serial, this is under the
spin_lock_irqsave(). The NUMA balancer would only be able to migrate the
thread outside of the spin_lock critical section. And outside of the
spin_lock critical section, the thread does not own/hold any resources
at all. So it should be no problem to migrate it.

> Our next round of tests aim to disable numa balancing on
> large-multi-core-systems to determine whether it is the NUMA
> mechanisms + nbcon interactions are at fault here.

I am curious if starting a kthread using the code I wrote above (with
nice=-20) would cause similar issues.

> How to reproduce:

I am currently traveling, so my time/resources are limited. But next
week I will be able to look into this in detail. Thanks for this work!

John Ogness