[AMD Official Use Only - General] Hi Felix, Yes, will correct the description. Will add another patch to handle the return for sync wait. Emily Deng Best Wishes >-----Original Message----- >From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >Sent: Friday, October 20, 2023 12:05 AM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Add timeout for sync wait > >On 2023-10-19 05:31, Emily Deng wrote: >> Issue: Dead heappen during gpu recover >> >> [56433.829492] amdgpu 0000:04:00.0: amdgpu: GPU reset begin! >> [56550.499625] INFO: task kworker/u80:0:10 blocked for more than 120 >seconds. >> [56550.520215] Tainted: G OE 6.2.0-34-generic #34~22.04.1- >Ubuntu >> [56550.542883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >disables this message. >> [56550.566313] task:kworker/u80:0 state:D stack:0 pid:10 ppid:2 >flags:0x00004000 >> [56550.591318] Workqueue: kfd_restore_wq restore_process_worker >> [amdgpu] [56550.611391] Call Trace: >> [56550.618698] <TASK> >> [56550.624968] __schedule+0x2b7/0x5f0 [56550.635416] >> schedule+0x68/0x110 [56550.645090] schedule_timeout+0x151/0x160 >> [56550.657096] ? amdgpu_vm_bo_update+0x46e/0x660 [amdgpu] >> [56550.673245] dma_fence_default_wait+0x1a2/0x1e0 >> [56550.686818] ? __pfx_dma_fence_default_wait_cb+0x10/0x10 >> [56550.702728] dma_fence_wait_timeout+0x117/0x140 >> [56550.716301] amdgpu_sync_wait+0x62/0xa0 [amdgpu] [56550.730654] >> amdgpu_amdkfd_gpuvm_restore_process_bos+0x59e/0x770 [amdgpu] >> [56550.751668] ? newidle_balance+0x298/0x490 [56550.763936] >> restore_process_worker+0x42/0x270 [amdgpu] [56550.780183] >> process_one_work+0x21f/0x440 [56550.792193] worker_thread+0x50/0x3f0 >> [56550.803165] ? __pfx_worker_thread+0x10/0x10 [56550.815934] >> kthread+0xee/0x120 [56550.825342] ? __pfx_kthread+0x10/0x10 >> [56550.836548] ret_from_fork+0x2c/0x50 [56550.847262] </TASK> [ >> 1935.215502] Call Trace: >> [ 1935.222827] <TASK> >> [ 1935.229121] __schedule+0x23d/0x5a0 [ 1935.239583] >> schedule+0x4e/0xc0 [ 1935.248983] schedule_timeout+0x103/0x140 [ >> 1935.261002] __wait_for_common+0xae/0x150 [ 1935.273008] ? >> usleep_range_state+0x90/0x90 [ 1935.285546] >> wait_for_completion+0x24/0x30 [ 1935.297813] >> __flush_work.isra.0+0x175/0x280 [ 1935.310611] ? >> worker_detach_from_pool+0xc0/0xc0 [ 1935.324436] >> flush_delayed_work+0x31/0x50 [ 1935.336455] >> kfd_suspend_all_processes+0x96/0x150 [amdgpu] [ 1935.353429] >> kgd2kfd_suspend+0xb8/0xe0 [amdgpu] [ 1935.367469] >> kgd2kfd_pre_reset+0x81/0xf0 [amdgpu] [ 1935.382036] >> amdgpu_amdkfd_pre_reset+0x1a/0x30 [amdgpu] [ 1935.398156] >> amdgpu_device_gpu_recover.cold+0x210/0xcf2 [amdgpu] [ 1935.416722] >> amdgpu_job_timedout+0x19f/0x1e0 [amdgpu] [ 1935.432367] >> drm_sched_job_timedout+0x6f/0x120 [amd_sched] [ 1935.448792] >> process_one_work+0x22b/0x3d0 [ 1935.460806] >worker_thread+0x53/0x420 >> [ 1935.471777] ? process_one_work+0x3d0/0x3d0 [ 1935.484307] >> kthread+0x12a/0x150 [ 1935.493993] ? set_kthread_struct+0x50/0x50 [ >> 1935.506513] ret_from_fork+0x22/0x30 > >Looking at the time stamps, this seems to be a mash-up of two different logs. I >think you're trying to show how a restore_processes worker is stuck on a fence, >and that's causing kgd2kfd_pre_reset to hang when it tries to flush the work. > >The fence it's hanging on is probably something related to a page table update >that got caught up in the GPU hang. Adding a timeout here seems reasonable. >There may be another problem, because >amdgpu_amdkfd_gpuvm_restore_process_bos ignores the return value of >amdgpu_sync_wait. We shouldi probably handle the timeout gracefully with a >"goto validate_map_fail". > >Regards, > Felix > > >> >> It is because the amdgpu_sync_wait is waiting for the bad job's fence, >> and never return, so the recover couldn't continue. >> >> >> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 16 +++++++++++++--- >> 1 file changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c >> index dcd8c066bc1f..c922867c5675 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c >> @@ -406,9 +406,19 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, >bool intr) >> int i, r; >> >> hash_for_each_safe(sync->fences, i, tmp, e, node) { >> - r = dma_fence_wait(e->fence, intr); >> - if (r) >> - return r; >> + struct drm_sched_fence *s_fence = to_drm_sched_fence(e- >>fence); >> + >> + if (s_fence) { >> + r = dma_fence_wait_timeout(e->fence, intr, s_fence- >>sched->timeout); >> + if (r == 0) >> + r = -ETIMEDOUT; >> + if (r < 0) >> + return r; >> + } else { >> + r = dma_fence_wait(e->fence, intr); >> + if (r) >> + return r; >> + } >> >> amdgpu_sync_entry_free(e); >> }