On Tue Mar 5, 2024 at 5:57 AM UTC, Max Boone wrote: > On Tue Mar 5, 2024 at 12:32 AM UTC, Joel Fernandes wrote: > > Have you tried to reproduce the issue with CONFIG_RSEQ=n and see if it happens? > > Will build a new kernel today with that flag, and report back. With CONFIG_RSEQ=n the stalls happen a lot less often, the system is way more workable and when it does freeze up it recovered on its own once, and I was able to get the full kernel messages for this: ``` [ 675.812339] rcu: INFO: rcu_sched self-detected stall on CPU [ 675.814587] rcu: 3-....: (14893 ticks this GP) idle=762c/1/0x4000000000000000 softirq=6920/6920 fqs=6610 [ 675.815606] rcu: (t=15001 jiffies g=50497 q=1304 ncpus=8) [ 675.816520] CPU: 3 PID: 232 Comm: snapfuse Not tainted 6.7.7-WSL2-STABLE+ #2 [ 675.816550] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 675.816553] pc : __arch_copy_to_user+0x1a0/0x240 [ 675.817689] lr : _copy_to_iter+0xf0/0x560 [ 675.818069] sp : ffff800082ceba80 [ 675.818070] x29: ffff800082cebac0 x28: 0000000001b2c000 x27: 0000000000000005 [ 675.818074] x26: 0000000000000000 x25: ffff00004c491000 x24: 0000000000000000 [ 675.818076] x23: 0000000000001000 x22: 0000040000000000 x21: ffff800082cebd30 [ 675.818079] x20: ffff800082cebd30 x19: 0000000000001000 x18: 0000000000000000 [ 675.818081] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00004c491000 [ 675.818083] x14: 9887db4ae914c054 x13: 6bcd444ce14effe5 x12: 0b22b481c6001041 [ 675.818086] x11: 7513c0250d7df247 x10: b85affa4063b12c7 x9 : 368beb85bc648557 [ 675.818088] x8 : 217c88df9795370e x7 : a16d77942052b4ab x6 : 0000aaf844516fff [ 675.818090] x5 : 0000aaf844517e2f x4 : 0000000000000000 x3 : 0000000000003daf [ 675.818092] x2 : 0000000000000dc0 x1 : ffff00004c491210 x0 : 0000aaf844516e2f [ 675.818096] Call trace: [ 675.818143] __arch_copy_to_user+0x1a0/0x240 [ 675.818147] copy_page_to_iter+0xbc/0x140 [ 675.818150] filemap_read+0x1b0/0x398 [ 675.818427] generic_file_read_iter+0x48/0x168 [ 675.818429] ext4_file_read_iter+0x58/0x288 [ 675.818681] vfs_read+0x1e8/0x280 [ 675.818804] ksys_pread64+0x90/0xf0 [ 675.818806] __arm64_sys_pread64+0x24/0x48 [ 675.818807] invoke_syscall.constprop.0+0x54/0x128 [ 675.818912] do_el0_svc+0x44/0xf0 [ 675.818914] el0_svc+0x24/0xb0 [ 675.819041] el0t_64_sync_handler+0x138/0x148 [ 675.819043] el0t_64_sync+0x14c/0x150 [ 681.501178] block sda: the capability attribute has been deprecated. [ 741.700330] rcu: INFO: rcu_sched self-detected stall on CPU [ 741.701707] rcu: 4-....: (14940 ticks this GP) idle=8074/1/0x4000000000000000 softirq=13021/13037 fqs=6400 [ 741.703152] rcu: (t=15001 jiffies g=50713 q=5093 ncpus=8) [ 741.704017] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2 [ 741.704047] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 741.704050] pc : __arch_copy_to_user+0x190/0x240 [ 741.704424] lr : _copy_to_iter+0xf0/0x560 [ 741.704565] sp : ffff800082cfb870 [ 741.704566] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085 [ 741.704569] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000 [ 741.704571] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90 [ 741.704574] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000 [ 741.704608] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00 [ 741.704610] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070 [ 741.704612] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937 [ 741.704614] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0 [ 741.704617] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000 [ 741.704619] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0 [ 741.704621] Call trace: [ 741.704647] __arch_copy_to_user+0x190/0x240 [ 741.704651] simple_copy_to_iter+0x48/0x98 [ 741.704939] __skb_datagram_iter+0x7c/0x280 [ 741.704941] skb_copy_datagram_iter+0x48/0xc8 [ 741.704943] unix_stream_read_actor+0x30/0x68 [ 741.705137] unix_stream_read_generic+0x304/0xb70 [ 741.705139] unix_stream_recvmsg+0xc0/0xd0 [ 741.705140] sock_recvmsg+0x88/0x108 [ 741.705170] ____sys_recvmsg+0x78/0x198 [ 741.705171] ___sys_recvmsg+0x80/0xf0 [ 741.705173] __sys_recvmsg+0x5c/0xd0 [ 741.705175] __arm64_sys_recvmsg+0x28/0x50 [ 741.705177] invoke_syscall.constprop.0+0x54/0x128 [ 741.705316] do_el0_svc+0xcc/0xf0 [ 741.705317] el0_svc+0x24/0xb0 [ 741.705369] el0t_64_sync_handler+0x138/0x148 [ 741.705371] el0t_64_sync+0x14c/0x150 [ 743.232431] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 4-.... } 15347 jiffies s: 721 root: 0x10/. [ 743.234297] rcu: blocking rcu_node structures (internal RCU debug): [ 743.235477] Sending NMI from CPU 1 to CPUs 4: [ 743.235491] NMI backtrace for cpu 4 [ 743.235531] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2 [ 743.235535] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 743.235537] pc : __arch_copy_to_user+0x190/0x240 [ 743.235598] lr : _copy_to_iter+0xf0/0x560 [ 743.235603] sp : ffff800082cfb870 [ 743.235604] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085 [ 743.235607] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000 [ 743.235610] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90 [ 743.235612] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000 [ 743.235614] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00 [ 743.235617] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070 [ 743.235619] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937 [ 743.235621] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0 [ 743.235623] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000 [ 743.235626] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0 [ 743.235628] Call trace: [ 743.235630] __arch_copy_to_user+0x190/0x240 [ 743.235632] simple_copy_to_iter+0x48/0x98 [ 743.235636] __skb_datagram_iter+0x7c/0x280 [ 743.235639] skb_copy_datagram_iter+0x48/0xc8 [ 743.235641] unix_stream_read_actor+0x30/0x68 [ 743.235644] unix_stream_read_generic+0x304/0xb70 [ 743.235646] unix_stream_recvmsg+0xc0/0xd0 [ 743.235647] sock_recvmsg+0x88/0x108 [ 743.235650] ____sys_recvmsg+0x78/0x198 [ 743.235651] ___sys_recvmsg+0x80/0xf0 [ 743.235653] __sys_recvmsg+0x5c/0xd0 [ 743.235655] __arm64_sys_recvmsg+0x28/0x50 [ 743.235657] invoke_syscall.constprop.0+0x54/0x128 [ 743.235661] do_el0_svc+0xcc/0xf0 [ 743.235663] el0_svc+0x24/0xb0 [ 743.235667] el0t_64_sync_handler+0x138/0x148 [ 743.235668] el0t_64_sync+0x14c/0x150 ``` Another time it didn't recover, and I was only able to get whatever was printed to console: ``` [ 1559.425979] rcu: 7-....: (14977 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263 [ 1559.431367] rcu: (t=15002 jiffies g=67965 q=36939 ncpus=8) [ 1559.432083] rcu: rcu_sched kthread starved for 2866 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 1559.433645] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 1559.434604] rcu: RCU grace-period kthread stack dump: [ 1559.435549] rcu: Stack dump where RCU GP kthread last ran: [ 1739.451891] rcu: INFO: rcu_sched self-detected stall on CPU [ 1739.452511] rcu: 7-....: (59616 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263 [ 1739.453498] rcu: (t=60008 jiffies g=67965 q=36939 ncpus=8) [ 1739.454053] rcu: rcu_sched kthread starved for 47871 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 1739.455135] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 1739.456110] rcu: RCU grace-period kthread stack dump: [ 1739.456687] rcu: Stack dump where RCU GP kthread last ran: [ 1919.467822] rcu: INFO: rcu_sched self-detected stall on CPU [ 1919.468776] rcu: 7-....: (104010 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263 [ 1919.470405] rcu: (t=105012 jiffies g=67965 q=36941 ncpus=8) [ 1919.472599] rcu: rcu_sched kthread starved for 92875 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 1919.474013] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 1919.474977] rcu: RCU grace-period kthread stack dump: [ 1919.475641] rcu: Stack dump where RCU GP kthread last ran: ``` Unfortunately the traces are not provided there.