[BUG] workqueues and printk not playing nice since next-20240130

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 2 Feb 2024 05:04:45 -0800

Hello!

Starting with next-20240130 (and perhaps a bit earlier), rcutorture gets
what initially looked like early-boot hangs, but only when running on
dual-socket x86 systems [1], as it it works just fine on my x86 laptop [2].
But when running on dual-socket systems, this happens all the time,
perhaps because rcutorture works hard to split each guest OS across a
socket boundary.

This is the reproducer:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 1m --configs "10*TREE01" --trust-make

By "looked like early-boot hangs" I mean that qemu was quite happy,
but there was absolutely no console output.

Bisection identified this commit:

5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")

Reverting this commit made the problem go away.  Except that it is really
hard to imagine this commit having any effect whatsoever on early boot
execution.  Of course, this might be a failure of imagination on my part,
so I enlisted the aid of gdb:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 1m --configs "TREE01" --trust-make --gdb

After following the resulting gdb startup instructions and waiting for
about ten seconds, I hit control-C on the gdb window and then:

	(gdb) bt
	#0  default_idle () at arch/x86/kernel/process.c:743
	#1  0xffffffff81e94d34 in default_idle_call () at kernel/sched/idle.c:97
	#2  0xffffffff810d746d in cpuidle_idle_call () at kernel/sched/idle.c:170
	#3  do_idle () at kernel/sched/idle.c:312
	#4  0xffffffff810d76a4 in cpu_startup_entry (state=state@entry=CPUHP_ONLINE)
	    at kernel/sched/idle.c:410
	#5  0xffffffff81e95417 in rest_init () at init/main.c:730
	#6  0xffffffff8329adf2 in start_kernel () at init/main.c:1067
	#7  0xffffffff832a5038 in x86_64_start_reservations (
	    real_mode_data=real_mode_data@entry=0x13d50 <exception_stacks+32080> <error: Cannot access memory at address 0x13d50>) at arch/x86/kernel/head64.c:555
	#8  0xffffffff832a513c in x86_64_start_kernel (
	    real_mode_data=0x13d50 <exception_stacks+32080> <error: Cannot access memory at address 0x13d50>) at arch/x86/kernel/head64.c:536
	#9  0xffffffff810001d2 in secondary_startup_64 ()
	    at arch/x86/kernel/head_64.S:461
	#10 0x0000000000000000 in ?? ()
	(gdb) print jiffies
	$1 = 4294676330
	(gdb) print system_state
	$2 = SYSTEM_RUNNING

In other words, the system really has booted, and at least one CPU is
happily idling in the idle loop.  And another CPU is (maybe not quite
so happily) running rcutorture:

	(gdb) thread 6
	[Switching to thread 6 (Thread 1.6)]
	#0  0xffffffff8111160b in rcu_torture_one_read (
	    trsp=trsp@entry=0xffffc900004abe90, myid=myid@entry=4)
	    at kernel/rcu/rcutorture.c:2003
	2003            completed = cur_ops->get_gp_seq();
	(gdb) bt
	#0  0xffffffff8111160b in rcu_torture_one_read (
	    trsp=trsp@entry=0xffffc900004abe90, myid=myid@entry=4)
	    at kernel/rcu/rcutorture.c:2003
	#1  0xffffffff81111bef in rcu_torture_reader (arg=0x4 <fixed_percpu_data+4>)
	    at kernel/rcu/rcutorture.c:2097
	#2  0xffffffff810af3e0 in kthread (_create=0xffff8880047aa480)
	    at kernel/kthread.c:388
	#3  0xffffffff8103af1f in ret_from_fork (prev=<optimized out>,
	    regs=0xffffc900004abf58, fn=0xffffffff810af300 <kthread>,
	    fn_arg=0xffff8880047aa480) at arch/x86/kernel/process.c:147
	#4  0xffffffff8100247a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:242
	#5  0x0000000000000000 in ?? ()

So the system really did boot and is running just fine.  It is just that
there is no console output.  Details, details!

Is there anything I can do to some combination of workqueues and printk
to help debug this?  Or that I can do to anything else, as I am not
feeling all that picky.  ;-)

							Thanx, Paul

[1] The dual-socket system is an 80-hardware-thread (20 cores per socket)
    system with model name Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz.
    I get the same results when using either of these compilers:
    gcc version 8.5.0 20210514 (Red Hat 8.5.0-21) (GCC)
    gcc version 11.4.1 20230605 (Red Hat 11.4.1-2) (GCC)

[2] My laptop is a 16-hardware-thread (8 cores) single-socket system with
    model name "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" and
    gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04).