On 25/11/2019 14:10, Andrey Grodzovsky wrote: > Problem: > Due to a race between drm_sched_cleanup_jobs in sched thread and > drm_sched_job_timedout in timeout work there is a possiblity that > bad job was already freed while still being accessed from the > timeout thread. > > Fix: > Instead of just peeking at the bad job in the mirror list > remove it from the list under lock and then put it back later when > we are garanteed no race with main sched thread is possible which > is after the thread is parked. > > v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs. > > v3: Rebase on top of drm-misc-next. v2 is not needed anymore as > drm_sched_cleanup_jobs already has a lock there. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > Reviewed-by: Christian König <christian.koenig@xxxxxxx> > Tested-by: Emily Deng <Emily.Deng@xxxxxxx> > --- > drivers/gpu/drm/scheduler/sched_main.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 6774955..a604dfa 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -284,10 +284,24 @@ static void drm_sched_job_timedout(struct work_struct *work) > unsigned long flags; > > sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); > + > + /* > + * Protects against concurrent deletion in drm_sched_cleanup_jobs that ^^^^^^^^^^^^^^^^^^^^^ drm_sched_cleanup_jobs() is gone - replaced with drm_sched_get_cleanup_job() and code in drm_sched_main(). See: 588b9828f074 ("drm: Don't free jobs in wait_event_interruptible()") > + * is already in progress. > + */ > + spin_lock_irqsave(&sched->job_list_lock, flags); > job = list_first_entry_or_null(&sched->ring_mirror_list, > struct drm_sched_job, node); > > if (job) { > + /* > + * Remove the bad job so it cannot be freed by already in progress > + * drm_sched_cleanup_jobs. It will be reinsrted back after sched->thread ^^^^^^^^^ Typo: s/reinsrted/reinserted/ > + * is parked at which point it's safe. > + */ > + list_del_init(&job->node); > + spin_unlock_irqrestore(&sched->job_list_lock, flags); > + > job->sched->ops->timedout_job(job); > > /* > @@ -298,6 +312,8 @@ static void drm_sched_job_timedout(struct work_struct *work) > job->sched->ops->free_job(job); > sched->free_guilty = false; > } > + } else { > + spin_unlock_irqrestore(&sched->job_list_lock, flags); > } > > spin_lock_irqsave(&sched->job_list_lock, flags); > @@ -370,6 +386,19 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) > kthread_park(sched->thread); > > /* > + * Reinsert back the bad job here - now it's safe as drm_sched_cleanup_jobs Another reference to drm_sched_cleanup_jobs. > + * cannot race against us and release the bad job at this point - we parked > + * (waited for) any in progress (earlier) cleanups and any later ones will > + * bail out due to sched->thread being parked. As explained in the other patch - after the kthread_park() has returned there will be no more calls to drm_sched_get_cleanup_job() until the thread is unparked. Steve > + */ > + if (bad && bad->sched == sched) > + /* > + * Add at the head of the queue to reflect it was the earliest > + * job extracted. > + */ > + list_add(&bad->node, &sched->ring_mirror_list); > + > + /* > * Iterate the job list from later to earlier one and either deactive > * their HW callbacks or remove them from mirror list if they already > * signaled. > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx