When 2 rings met timeout at same time, triggered job_timedout separately. Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock. Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock. So dma_fence_remove_callback could not be called immediately. Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT. After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery. job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job. Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a Signed-off-by: Wentao Lou <Wentao.Lou@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 0a17fb1..fc1d3a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) trace_amdgpu_sched_run_job(job); - if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) + if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) { + /* flags might be signaled by unexpected callback, clear it */ + test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &finished->flags); dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */ + } if (finished->error < 0) { DRM_INFO("Skip scheduling IBs!\n"); -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx