On 14/09/2022 17:43, Arvind Yadav wrote: > Using the parent fence instead of the finished fence > to get the job status. This change is to avoid GPU > scheduler timeout error which can cause GPU reset. I'm able to reproduce crashes on Panfrost and I believe this commit is the cause. Specifically it's possible for job->s_fence->parent to be NULL. The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if the run_jobs() callback returns an error it will set s_fence->parent to NULL after signalling s_fence->finished: > fence = sched->ops->run_job(s_job); > i++; > > if (IS_ERR_OR_NULL(fence)) { > if (IS_ERR(fence)) > dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); > > s_job->s_fence->parent = NULL; I don't understand the reasoning behind this change, but it doesn't seem right to be using the parent fence when we have code which can be setting that pointer to NULL. Since I don't understand the reasoning my only suggestion is to revert this patch (and potentially the dependent patch "dma-buf: Check status of enable-signaling bit on debug"?). Can anyone suggest a better fix? Thanks, Steve > Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx> > Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > --- > > changes in v1,v2 - Enable signaling for finished fence in sche_main() > is removed > > --- > drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index e0ab14e0fb6b..2ac28ad11432 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > job = list_first_entry_or_null(&sched->pending_list, > struct drm_sched_job, list); > > - if (job && dma_fence_is_signaled(&job->s_fence->finished)) { > + if (job && dma_fence_is_signaled(job->s_fence->parent)) { > /* remove job from pending_list */ > list_del_init(&job->list); > > @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > if (next) { > next->s_fence->scheduled.timestamp = > - job->s_fence->finished.timestamp; > + job->s_fence->parent->timestamp; > /* start TO timer for next job */ > drm_sched_start_timeout(sched); > }