Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 6 Feb 2020 12:49:58 +0100

Am 06.02.20 um 12:10 schrieb Lucas Stach:
Hi all,

On Mi, 2020-02-05 at 19:24 +0100, Lucas Stach wrote:
Hi Andrey,

This commit breaks all drivers, which may bail out of the timeout
processing as they wish to extend the timeout (etnaviv, v3d).

Those drivers currently just return from the timeout handler before
calling drm_sched_stop(), which means with this commit applied we are
removing the first job from the ring_mirror_list, but never put it
back. This leads to jobs getting lost from the ring mirror, which then
causes quite a bit of fallout like unsignaled fences.

Not sure yet what to do about it, we can either add a function to add
the job back to the ring_mirror if the driver wants to extend the
timeout, or we could look for another way to stop
drm_sched_cleanup_jobs from freeing jobs that are currently in timeout
processing.
So after thinking about this a bit more my opinion is that we need to
revert this change for now and go back to the drawing board for the
scheduler timeout handling.

Right now this starts to feel like a big midlayer mistake with all the
very intricate intertwining between the drivers and the scheduler. The
rules on when it's safe to manipulate the ring mirror and when
completed jobs are signaled and freed are not really well specified.
The fact that we need to mutate state in order to get rid of races
instead of having a single big "timeout processing is owner of the
scheduler state for now" is a big fat warning sign IMHO.

Yes, that strongly feels like a hack to me as well. But I didn't had 
time and still haven't to take a closer look and suggest something better.

Christian.


It took me far longer than I'd like to admit to understand the failure
mode with fences not getting signaled after a GPU hang. The back and
forth between scheduler and driver code makes things really hard to
follow.

Regards,
Lucas

Regards,
Lucas

On Mo, 2019-11-25 at 15:51 -0500, Andrey Grodzovsky wrote:
Problem:
Due to a race between drm_sched_cleanup_jobs in sched thread and
drm_sched_job_timedout in timeout work there is a possiblity that
bad job was already freed while still being accessed from the
timeout thread.

Fix:
Instead of just peeking at the bad job in the mirror list
remove it from the list under lock and then put it back later when
we are garanteed no race with main sched thread is possible which
is after the thread is parked.

v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

v3: Rebase on top of drm-misc-next. v2 is not needed anymore as
drm_sched_get_cleanup_job already has a lock there.

v4: Fix comments to relfect latest code in drm-misc.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Reviewed-by: Christian König <christian.koenig@xxxxxxx>
Tested-by: Emily Deng <Emily.Deng@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++++
  1 file changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6774955..1bf9c40 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct work_struct *work)
  	unsigned long flags;
  
  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+
+	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
  	job = list_first_entry_or_null(&sched->ring_mirror_list,
  				       struct drm_sched_job, node);
  
  	if (job) {
+		/*
+		 * Remove the bad job so it cannot be freed by concurrent
+		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
+		 * is parked at which point it's safe.
+		 */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
  		job->sched->ops->timedout_job(job);
  
  		/*
@@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
  			job->sched->ops->free_job(job);
  			sched->free_guilty = false;
  		}
+	} else {
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
  	}
  
  	spin_lock_irqsave(&sched->job_list_lock, flags);
@@ -370,6 +383,20 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
  	kthread_park(sched->thread);
  
  	/*
+	 * Reinsert back the bad job here - now it's safe as
+	 * drm_sched_get_cleanup_job cannot race against us and release the
+	 * bad job at this point - we parked (waited for) any in progress
+	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
+	 * now until the scheduler thread is unparked.
+	 */
+	if (bad && bad->sched == sched)
+		/*
+		 * Add at the head of the queue to reflect it was the earliest
+		 * job extracted.
+		 */
+		list_add(&bad->node, &sched->ring_mirror_list);
+
+	/*
  	 * Iterate the job list from later to  earlier one and either deactive
  	 * their HW callbacks or remove them from mirror list if they already
  	 * signaled.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx