Re: Kernel Oops on alpha with kernel version >=6.9.x

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Sat, 30 Nov 2024 20:31:13 -0800

On Sat, Nov 30, 2024 at 11:22:45PM +0100, Magnus Lindholm wrote:
> Hi,
> 
> 
> First some background:
> I've been trying to boot recent kernels on my alpha machines. Anything
> after linux-6.8.12 gives me trouble. After doing a kernel bisect, I
> found that commit 9187210eee7d87eea37b45ea93454a88681894a4
> (net-next-6.9) is where my troubles begin. The problem consists in
> that the boot process gets stuck when trying to set parameters for
> network interfaces. The bad commit does make a lot of updates to the
> network code.
> 
> When booting the system with kernel 6.12.0 I'm able to boot into
> single-user mode, but when starting system services one by one I
> trigger a kernel Oops when the network interface is renamed (see stack
> dump below). Looking at the changes made by the bad commit, it seems
> to (among other things) be replacing the locking mechanism (RCU
> instead of rtnl_lock). The stack dump from the kernel Oops suggests
> that something is happening in the RCU locking code. I'm no expert on
> RCU-stuff but I read somewhere that it is done by volatile access on
> all systems other than DEC Alpha, where a memory barrier instruction
> is required. This indicates that the change could affect Alpha
> architecture differently? Inspecting the changes to networking code in
> the bad commit, particularly the changes made to net/core/dev.c, I put
> together the patch below. This patch reverts one of the lines changed
> in the "bad commit" for net/core/dev.c. After reverting the change on
> just this line, I'm able to boot kernel 6.12.0 on my Alpha ES-40 to
> full multi-user again. I've tested this on an Alpha ES40 and an
> UP2000+ and the problem is 100% reproducible on both machines.
> 
> The patch might not be a real solution to the problem but could be a good
> place to start looking when figuring out what's really going on. The feedback
> I've gotten so far (forums and the netdev mailing list) is that the
> RCU implementation on alpha is probably where things go wrong.

Does booting with the "rcupdate.rcu_normal=1" kernel boot parameter
also suppress the problem?

That "pc =" down below is the program counter?  If so, I am at a loss
as to what RCU could do to make it be zero.

							Thanx, Paul

> ------------------------------------
> Patch to "fix" the problem:
> -----------------------------------
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 13d00fc10f55..26fda14367e5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1261,7 +1261,7 @@ int dev_change_name(struct net_device *dev,
> const char *newname)
> 
>         netdev_name_node_del(dev->name_node);
> 
> -       synchronize_net();
> +       synchronize_rcu();
> 
>         netdev_name_node_add(net, dev->name_node);
> 
> 
> --------------------------
> dmesg/kernel log:
> -------------------------
> 
> [   93.431592] tulip 0000:01:02.0 enp1s2: renamed from eth0
> 
> [   93.436475] Unable to handle kernel paging request at virtual
> address 0000000000000000
> [   93.436475] CPU 1
> [   93.436475] rcu_exp_gp_kthr(17): Oops -1
> [   93.436475] pc = [<0000000000000000>]  ra = [<0000000000000000>]
> ps = 0000    Not tainted
> [   93.436475] pc is at 0x0
> [   93.436475] ra is at 0x0
> [   93.436475] v0 = 0000000000000007  t0 = fffffc0000e62440  t1 =
> 0000000000000001
> [   93.436475] t2 = 0000000000000000  t3 = 0000000000000001  t4 =
> 0000000000000001
> [   93.436475] t5 = 0000000000000001  t6 = 0000000000000001  t7 =
> fffffc0003138000
> [   93.436475] s0 = fffffc0000e62440  s1 = fffffc0000ec3a10  s2 =
> fffffc0000ec3a10
> [   93.436475] s3 = fffffc0000ec3a10  s4 = fffffc00003a90f0  s5 =
> fffffc0000e62440
> [   93.436475] s6 = 0000000000000000
> [   93.436475] a0 = 0000000000000000  a1 = 0000000000000000  a2 =
> 0000000000000000
> [   93.436475] a3 = 0000000000000000  a4 = 0000000000000001  a5 =
> fffffc0000517744
> [   93.436475] t8 = 0000000000000001  t9 = 0000000000000001  t10=
> fffffc0000e3d320
> [   93.436475] t11= fffffc0000220240  pv = fffffc0000b73210  at =
> 0000000000000000
> [   93.436475] gp = fffffc0000eb3a10  sp = 00000000ea2ea184
> [   93.436475] Disabling lock debugging due to kernel taint
> [   93.436475] Trace:
> [   93.436475] [<fffffc00003aee60>] wait_rcu_exp_gp+0x30/0xa0
> [   93.436475] [<fffffc0000b6c200>] __cond_resched+0x30/0x90
> [   93.436475] [<fffffc00003569b8>] kthread_worker_fn+0xc8/0x1f0
> [   93.436475] [<fffffc000035863c>] kthread+0x17c/0x1c0
> [   93.436475] [<fffffc00003568f0>] kthread_worker_fn+0x0/0x1f0
> [   93.436475] [<fffffc0000311128>] ret_from_kernel_thread+0x18/0x20
> 
> [   93.436475] Code:
> [   93.436475]  00000000
> [   93.436475]  00000000
> [   93.436475]  00063301
> [   93.436475]  0000077c
> [   93.436475]  00001111
> [   93.436475]  000022a2