On Wed, Sep 13, 2023 at 10:56:10PM -0400, Luben Tuikov wrote: > On 2023-09-11 22:16, Matthew Brost wrote: > > If the TDR is set to a value, it can fire before a job is submitted in > > drm_sched_main. The job should be always be submitted before the TDR > > fires, fix this ordering. > > > > v2: > > - Add to pending list before run_job, start TDR after (Luben, Boris) > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index c627d3e6494a..9dbfab7be2c6 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -498,7 +498,6 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) > > > > spin_lock(&sched->job_list_lock); > > list_add_tail(&s_job->list, &sched->pending_list); > > - drm_sched_start_timeout(sched); > > spin_unlock(&sched->job_list_lock); > > } > > > > @@ -1234,6 +1233,7 @@ static void drm_sched_run_job_work(struct work_struct *w) > > fence = sched->ops->run_job(sched_job); > > complete_all(&entity->entity_idle); > > drm_sched_fence_scheduled(s_fence, fence); > > + drm_sched_start_timeout_unlocked(sched); > > > > if (!IS_ERR_OR_NULL(fence)) { > > /* Drop for original kref_init of the fence */ > > So, sched->ops->run_job(), is a "job inflection point" from the point of view of > the DRM scheduler. After that call, DRM has relinquished control of the job to the > firmware/hardware. > > Putting the job in the pending list, before submitting it to down to the firmware/hardware, > goes along with starting a timeout timer for the job. The timeout always includes > time for the firmware/hardware to get it prepped, as well as time for the actual > execution of the job (task). Thus, we want to do this: > 1. Put the job in pending list. "Pending list" means "pends in hardware". > 2. Start a timeout timer for the job. > 3. Start executing the job/task. This usually involves giving it to firmware/hardware, > i.e. ownership of the job/task changes to another domain. In our case this is accomplished > by calling sched->ops->run_job(). > Perhaps move drm_sched_start_timeout() closer to sched->ops->run_job() from above and/or increase > the timeout value? I disagree. It is clear race if the timeout starts before run_job() that the TDR can fire before run_job() is called. The entire point of this patch is to seal this race by starting the TDR after run_job() is called. Matt > -- > Regards, > Luben >