I think it's a neat approach, thanks ! /Monk -----Original Message----- From: Koenig, Christian Sent: Friday, May 10, 2019 8:07 PM To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: avoid duplicated tmo report on same job Yeah, that's indeed a bit problematic. How about calling drm_sched_suspend_timeout() then? On the other hand just suppressing the logging is fine with me as well. Christian. Am 10.05.19 um 12:01 schrieb Liu, Monk: > Christian, > > Would your approach leave the global queue (which holds TDR work) stuck there and other work item never get handled ? > > /Monk > > -----Original Message----- > From: Koenig, Christian > Sent: Friday, May 10, 2019 4:58 PM > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu: avoid duplicated tmo report on same > job > > Hi Monk, > > yeah, that's much closer to what I had in mind. But your comments got me thinking more about this. > > What do you think about changing amdgpu_job_timedout() like this: > if (amdgpu_device_should_recover_gpu(ring->adev)) > amdgpu_device_gpu_recover(ring->adev, job); > + else > + dma_fence_wait(s_job->s_fence->parent); > > > This way we would block the timeout worker until we are either manually resetting using debugfs or unloading the driver. > > Regards, > Christian. > > Am 10.05.19 um 09:19 schrieb Liu, Monk: >> Hi Christian, >> >> What about this one, no timer logic change on scheduler part, only >> the duplicated tmo report is muted >> >> /Monk >> >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >> Monk Liu >> Sent: Friday, May 10, 2019 3:18 PM >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: Liu, Monk <Monk.Liu@xxxxxxx> >> Subject: [PATCH] drm/amdgpu: avoid duplicated tmo report on same job >> >> [CAUTION: External Email] >> >> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +-- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++++++++++- >> include/drm/gpu_scheduler.h | 1 + >> 3 files changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index d6286ed..f1dc0ba 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3356,8 +3356,7 @@ bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev) >> return true; >> >> disabled: >> - DRM_INFO("GPU recovery disabled.\n"); >> - return false; >> + return false; >> } >> >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 1397942..ca62253 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -33,6 +33,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); >> struct amdgpu_job *job = to_amdgpu_job(s_job); >> struct amdgpu_task_info ti; >> + bool recover; >> >> memset(&ti, 0, sizeof(struct amdgpu_task_info)); >> >> @@ -42,6 +43,11 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> return; >> } >> >> + recover = amdgpu_device_should_recover_gpu(ring->adev); >> + if (s_job->sched->last_tmo_id == s_job->id) >> + goto skip_report; >> + >> + s_job->sched->last_tmo_id = s_job->id; >> amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); >> DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", >> job->base.sched->name, >> atomic_read(&ring->fence_drv.last_seq), >> @@ -49,7 +55,11 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> DRM_ERROR("Process information: process %s pid %d thread %s pid %d\n", >> ti.process_name, ti.tgid, ti.task_name, ti.pid); >> >> - if (amdgpu_device_should_recover_gpu(ring->adev)) >> + if (!recover) >> + DRM_INFO("GPU recovery disabled.\n"); >> + >> +skip_report: >> + if (recover) >> amdgpu_device_gpu_recover(ring->adev, job); } >> >> diff --git a/include/drm/gpu_scheduler.h >> b/include/drm/gpu_scheduler.h index 9c2a957..1944d27 100644 >> --- a/include/drm/gpu_scheduler.h >> +++ b/include/drm/gpu_scheduler.h >> @@ -282,6 +282,7 @@ struct drm_gpu_scheduler { >> int hang_limit; >> atomic_t num_jobs; >> bool ready; >> + uint64_t last_tmo_id; >> }; >> >> int drm_sched_init(struct drm_gpu_scheduler *sched, >> -- >> 2.7.4 >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx