Re: [PATCH 2/2] drm/sched: fix timeout handling

Nayan Deshmukh <nayan26deshmukh@xxxxxxxxx> · Tue, 9 Oct 2018 02:40:18 +0900

Hi Christian,

I am a bit confused with this patch. It will be better if you can
explain what these 3 functions are supposed to do. From my
understanding this is what they are supposed to do:

1. drm_sched_job_timedout: This function is called when a job has been
executing for more than "timeout" seconds. This function needs to
identify the wrong job and call the timedout_job function which will
notify the driver of that job and the gpu should be recovered to its
original state.

2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
job. ATM this function first removes all the callback and goes through
the ring_mirror_list afterward to find the bad job. I think it should
do it in the opposite order. Or my understanding of this function is
not fully correct.

3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
the job to the hardware queue avoiding the guilty job.

Some other questions and suggestions:

1. We can avoid the race condition altogether if we shift the
canceling and rescheduling of the timedout work item to the
drm_sched_process_job().

2. How does removing and adding the callback help with the race
condition? Moreover, the hardware might execute some jobs while we are
in this function leading to more races.

3. What is the order in which the above 3 functions should be executed
by the hardware? I think the answer to this question might clear a lot
of my doubts.

Regards,
Nayan

On Mon, Oct 8, 2018 at 8:36 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> We need to make sure that we don't race between job completion and
> timeout.
>
> Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index bd7d11c47202..ad3c57c9fd21 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>  static void drm_sched_job_timedout(struct work_struct *work)
>  {
>         struct drm_gpu_scheduler *sched;
> +       struct drm_sched_fence *fence;
>         struct drm_sched_job *job;
> +       int r;
>
>         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> +
> +       spin_lock(&sched->job_list_lock);
> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> +               fence = job->s_fence;
> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> +                       goto already_signaled;
> +       }
> +
>         job = list_first_entry_or_null(&sched->ring_mirror_list,
>                                        struct drm_sched_job, node);
> +       spin_unlock(&sched->job_list_lock);
>
>         if (job)
> -               job->sched->ops->timedout_job(job);
> +               sched->ops->timedout_job(job);
> +
> +       spin_lock(&sched->job_list_lock);
> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
> +               fence = job->s_fence;
> +               if (!fence->parent || !list_empty(&fence->cb.node))
> +                       continue;
> +
> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
> +                                          drm_sched_process_job);
> +               if (r)
> +already_signaled:
> +                       drm_sched_process_job(fence->parent, &fence->cb);
> +
> +       }
> +       spin_unlock(&sched->job_list_lock);
>  }
>
>  /**
> --
> 2.14.1
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel