Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Tue, 3 Dec 2019 14:10:05 -0500

Yes - Christian just pushed it to drm-next-misc - I guess Alex/Christian 
didn't pull to amd-staging-drm-next yet.

Andrey

On 12/2/19 2:24 PM, Deng, Emily wrote:
[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,
     Seems this patch is still not in amd-staging-drm-next?

Best wishes
Emily Deng



-----Original Message-----
From: Deng, Emily
Sent: Tuesday, November 26, 2019 4:41 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig,
Christian <Christian.Koenig@xxxxxxx>; steven.price@xxxxxxx
Subject: RE: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Emily Deng <Emily.Deng@xxxxxxx>

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Sent: Tuesday, November 26, 2019 7:37 AM
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily
<Emily.Deng@xxxxxxx>; steven.price@xxxxxxx
Subject: Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

Ping

Andrey

On 11/25/19 3:51 PM, Andrey Grodzovsky wrote:
Problem:
Due to a race between drm_sched_cleanup_jobs in sched thread and
drm_sched_job_timedout in timeout work there is a possiblity that bad
job was already freed while still being accessed from the timeout
thread.

Fix:
Instead of just peeking at the bad job in the mirror list remove it
from the list under lock and then put it back later when we are
garanteed no race with main sched thread is possible which is after
the thread is parked.

v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

v3: Rebase on top of drm-misc-next. v2 is not needed anymore as
drm_sched_get_cleanup_job already has a lock there.

v4: Fix comments to relfect latest code in drm-misc.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Reviewed-by: Christian König <christian.koenig@xxxxxxx>
Tested-by: Emily Deng <Emily.Deng@xxxxxxx>
---
   drivers/gpu/drm/scheduler/sched_main.c | 27
+++++++++++++++++++++++++++
   1 file changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 6774955..1bf9c40 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct
work_struct *work)
   	unsigned long flags;

   	sched = container_of(work, struct drm_gpu_scheduler,
work_tdr.work);
+
+	/* Protects against concurrent deletion in
drm_sched_get_cleanup_job */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
   	job = list_first_entry_or_null(&sched->ring_mirror_list,
   				       struct drm_sched_job, node);

   	if (job) {
+		/*
+		 * Remove the bad job so it cannot be freed by concurrent
+		 * drm_sched_cleanup_jobs. It will be reinserted back after
sched->thread
+		 * is parked at which point it's safe.
+		 */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
   		job->sched->ops->timedout_job(job);

   		/*
@@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct
work_struct *work)
   			job->sched->ops->free_job(job);
   			sched->free_guilty = false;
   		}
+	} else {
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
   	}

   	spin_lock_irqsave(&sched->job_list_lock, flags); @@ -370,6 +383,20
@@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
drm_sched_job *bad)
   	kthread_park(sched->thread);

   	/*
+	 * Reinsert back the bad job here - now it's safe as
+	 * drm_sched_get_cleanup_job cannot race against us and release the
+	 * bad job at this point - we parked (waited for) any in progress
+	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be
called
+	 * now until the scheduler thread is unparked.
+	 */
+	if (bad && bad->sched == sched)
+		/*
+		 * Add at the head of the queue to reflect it was the earliest
+		 * job extracted.
+		 */
+		list_add(&bad->node, &sched->ring_mirror_list);
+
+	/*
   	 * Iterate the job list from later to  earlier one and either deactive
   	 * their HW callbacks or remove them from mirror list if they already
   	 * signaled.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel