On Mon, Jun 27, 2022 at 03:22:24PM -0400, Alex Xu (Hello71) wrote: > Hi, > > Since Linux 5.19-ish, I consistently get these types of errors when > resuming from S3: > > [15652.909157] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9981 root: 0x800/. > [15652.909162] rcu: blocking rcu_node structures (internal RCU debug): > [15652.909163] Task dump for CPU 11: > [15652.909164] task:kworker/u24:65 state:R running task stack: 0 pid:210218 ppid: 2 flags:0x00004008 > [15652.909167] Workqueue: events_unbound async_run_entry_fn > [15652.909172] Call Trace: > [15652.909173] <TASK> > [15652.909174] ? atom_get_src_int+0x38e/0x680 > [15652.909179] ? atom_op_test+0x67/0x190 > [15652.909181] ? amdgpu_atom_execute_table_locked+0x19a/0x300 > [15652.909184] ? atom_op_calltable+0xb1/0x110 > [15652.909186] ? amdgpu_atom_execute_table_locked+0x19a/0x300 > [15652.909189] ? atom_op_calltable+0xb1/0x110 > [15652.909191] ? amdgpu_atom_execute_table_locked+0x19a/0x300 > [15652.909193] ? __switch_to+0x137/0x440 > [15652.909195] ? amdgpu_atom_asic_init+0xe0/0x100 > [15652.909198] ? pci_bus_read_config_dword+0x36/0x50 > [15652.909201] ? amdgpu_device_resume+0x10b/0x3e0 > [15652.909203] ? amdgpu_pmops_resume+0x32/0x60 > [15652.909204] ? pci_pm_suspend+0x2b0/0x2b0 > [15652.909206] ? dpm_run_callback+0x35/0x1f0 > [15652.909209] ? device_resume+0x1ca/0x220 > [15652.909211] ? async_resume+0x19/0xe0 > [15652.909213] ? async_run_entry_fn+0x33/0x120 > [15652.909215] ? process_one_work+0x1d6/0x350 > [15652.909218] ? worker_thread+0x24d/0x480 > [15652.909220] ? kthread+0x137/0x150 > [15652.909221] ? worker_clr_flags+0x40/0x40 > [15652.909224] ? kthread_blkcg+0x30/0x30 > [15652.909226] ? ret_from_fork+0x22/0x30 > [15652.909227] </TASK> > [15653.015808] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9985 root: 0x800/. > [15653.015812] rcu: blocking rcu_node structures (internal RCU debug): > [15653.015813] Task dump for CPU 11: > [15653.015813] task:kworker/u24:65 state:R running task stack: 0 pid:210218 ppid: 2 flags:0x00004008 > [15653.015816] Workqueue: events_unbound async_run_entry_fn > [15653.015820] Call Trace: > [15653.015820] <TASK> > [15653.015821] ? amdgpu_cgs_read_register+0x10/0x10 > [15653.015825] ? smu7_copy_bytes_to_smc+0xd4/0x200 > [15653.015828] ? polaris10_program_memory_timing_parameters+0x195/0x1b0 > [15653.015831] ? sysvec_apic_timer_interrupt+0xa/0x80 > [15653.015834] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > [15653.015836] ? amdgpu_cgs_destroy_device+0x10/0x10 > [15653.015839] ? sysvec_apic_timer_interrupt+0xa/0x80 > [15653.015841] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > [15653.015843] ? amdgpu_cgs_destroy_device+0x10/0x10 > [15653.015846] ? amdgpu_device_rreg+0x8f/0xd0 > [15653.015847] ? phm_wait_for_register_unequal+0x99/0xd0 > [15653.015850] ? smu7_send_msg_to_smc+0x95/0x130 > [15653.015853] ? smum_send_msg_to_smc+0x5d/0xa0 > [15653.015854] ? amdgpu_cgs_read_ind_register+0xa0/0xa0 > [15653.015857] ? smu7_enable_dpm_tasks+0x241f/0x28c0 > [15653.015859] ? hwmgr_resume+0x31/0x70 > [15653.015861] ? amdgpu_device_resume+0x1fa/0x3e0 > [15653.015863] ? amdgpu_pmops_resume+0x32/0x60 > [15653.015864] ? pci_pm_suspend+0x2b0/0x2b0 > [15653.015866] ? dpm_run_callback+0x35/0x1f0 > [15653.015868] ? device_resume+0x1ca/0x220 > [15653.015870] ? async_resume+0x19/0xe0 > [15653.015872] ? async_run_entry_fn+0x33/0x120 > [15653.015874] ? process_one_work+0x1d6/0x350 > [15653.015877] ? worker_thread+0x24d/0x480 > [15653.015878] ? kthread+0x137/0x150 > [15653.015880] ? worker_clr_flags+0x40/0x40 > [15653.015882] ? kthread_blkcg+0x30/0x30 > [15653.015884] ? ret_from_fork+0x22/0x30 > [15653.015886] </TASK> > > I have not noticed any resulting problems. I am reporting this in the > hope that it is easy to fix the issue and remove the error messages > which may obscure some future problem. The usual way this happens is for a task to be spinning. In this case, that might be due to excessive lock contention on async_lock within the async_run_entry_fn() function. Or perhaps the problem is in one of the functions preceded by "?" above. One way to debug this is to place trace_printk()s or similar in the functions called out above to track it down. I bet that the reason this showed up in v5.19 is this guy: 28b3ae426598 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT") So do you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to some small number of milliseconds? If so, you can override this by adjusting the CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option or by booting with a longer timeout via the rcupdate.rcu_exp_cpu_stall_timeout= kernel boot parameter. But if you are running on an Android platform, Uladzislau will be interested in addressing the underlying issue, so I have added him on CC. Thanx, Paul