Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Tue, 31 Aug 2021 10:20:40 -0400

On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
It's says patch [2/2] but i can't find patch 1

On 2021-08-31 6:35 a.m., Monk Liu wrote:
tested-by: jingwen chen <jingwen.chen@xxxxxxx>
Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx>
---
   drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
   1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ecf8140..894fdb24 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
   	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
+	if (!__kthread_should_park(sched->thread))
+		kthread_park(sched->thread);
+

As mentioned before, without serializing against other TDR handlers from
other
schedulers you just race here against them, e.g. you parked it now but
another
one in progress will unpark it as part of calling  drm_sched_start for other
rings[1]
Unless I am missing something since I haven't found patch [1/2]

[1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1Y8Tuh2fLtexYsGrmQD2ITTSIfUVJmqTylwgMryDjxw%3D&amp;reserved=0
You need to have your own wq and run all your tdr work on the same wq if
your reset has any cross-engine impact.


IMHO what is problematic in serializing vs. locking (with trylock and 
bail out like we do in [1]) is for multiple TO events arising from same 
reason
like maybe one job just waits for another and once first is hanged the 
second will also appear to be hanged triggering it's own TO event.
In this case multiple TOs event will trigger multiple resets if we 
serialize but if we use lock with trylock the second one will quietly 
bail out.

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4903

Andrey



See

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tLjFaN7mREYjjydxHszbQlTk3lwH4bQtBDVzHFHvPJg%3D&amp;reserved=0

for the ->timeout_job callback docs. I thought I brought this up already?
-Daniel


Yes, this discussion is a continuation of your comment about 
serializing, I mentioned before that you proposed it.

Andrey



Andrey


   	spin_lock(&sched->job_list_lock);
   	job = list_first_entry_or_null(&sched->pending_list,
   				       struct drm_sched_job, list);
   	if (job) {
-		/*
-		 * Remove the bad job so it cannot be freed by concurrent
-		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-		 * is parked at which point it's safe.
-		 */
-		list_del_init(&job->list);
   		spin_unlock(&sched->job_list_lock);
+		/* vendor's timeout_job should call drm_sched_start() */
   		status = job->sched->ops->timedout_job(job);
   		/*
@@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
   	kthread_park(sched->thread);
   	/*
-	 * Reinsert back the bad job here - now it's safe as
-	 * drm_sched_get_cleanup_job cannot race against us and release the
-	 * bad job at this point - we parked (waited for) any in progress
-	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-	 * now until the scheduler thread is unparked.
-	 */
-	if (bad && bad->sched == sched)
-		/*
-		 * Add at the head of the queue to reflect it was the earliest
-		 * job extracted.
-		 */
-		list_add(&bad->list, &sched->pending_list);
-
-	/*
   	 * Iterate the job list from later to  earlier one and either deactive
   	 * their HW callbacks or remove them from pending list if they already
   	 * signaled.