On 4/16/19 5:47 AM, Christian König wrote: > Am 15.04.19 um 23:17 schrieb Eric Anholt: >> Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> writes: >> >>> From: Christian König <christian.koenig@xxxxxxx> >>> >>> We now destroy finished jobs from the worker thread to make sure that >>> we never destroy a job currently in timeout processing. >>> By this we avoid holding lock around ring mirror list in drm_sched_stop >>> which should solve a deadlock reported by a user. >>> >>> v2: Remove unused variable. >>> >>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 >>> >>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- >>> drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 138 >>> +++++++++++++++++------------ >>> drivers/gpu/drm/v3d/v3d_sched.c | 9 +- >> Missing corresponding panfrost and lima updates. You should probably >> pull in drm-misc for hacking on the scheduler. >> >>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c >>> b/drivers/gpu/drm/v3d/v3d_sched.c >>> index ce7c737b..8efb091 100644 >>> --- a/drivers/gpu/drm/v3d/v3d_sched.c >>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c >>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, >>> struct drm_sched_job *sched_job) >>> /* block scheduler */ >>> for (q = 0; q < V3D_MAX_QUEUES; q++) >>> - drm_sched_stop(&v3d->queue[q].sched); >>> + drm_sched_stop(&v3d->queue[q].sched, sched_job); >>> if(sched_job) >>> drm_sched_increase_karma(sched_job); >>> + /* >>> + * Guilty job did complete and hence needs to be manually removed >>> + * See drm_sched_stop doc. >>> + */ >>> + if (list_empty(&sched_job->node)) >>> + sched_job->sched->ops->free_job(sched_job); >> If the if (sched_job) is necessary up above, then this should clearly be >> under it. >> >> But, can we please have a core scheduler thing we call here instead of >> drivers all replicating it? > > Yeah that's also something I noted before. > > Essential problem is that we remove finished jobs from the mirror list > and so need to destruct them because we otherwise leak them. > > Alternative approach here would be to keep the jobs on the ring mirror > list, but not submit them again. > > Regards, > Christian. I really prefer to avoid this, it means adding extra flag to sched_job to check in each iteration of the ring mirror list. What about changing signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* instead of void, this way we can return the guilty job back from the driver specific handler to the generic drm_sched_job_timedout and release it there. Andrey > >> >>> + >>> /* get the GPU back into the init state */ >>> v3d_reset(v3d); > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx