Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Attached quick patch for per job TTL calculation to make more precises next timer expiration. It's on top of the patch in this thread. Let me know if this makes sense.

Andrey

On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:

On 2021-08-26 12:55 a.m., Monk Liu wrote:
issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.

fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next" job
we start_timeout again.

v2:
further cleanup the logic, and do the TDR timer cancelling if the signaled job
is the last one in its scheduler.

v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.

TODO:
1)introduce pause/resume scheduler in job_timeout to serial the handling
of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above serialization
(no race issue anymore with the serialization)

Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++---------------
  1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..ecf8140 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
  {
      struct drm_sched_job *job, *next;
  -    /*
-     * Don't destroy jobs while the timeout worker is running OR thread
-     * is being parked and hence assumed to not touch pending_list
-     */
-    if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
-        !cancel_delayed_work(&sched->work_tdr)) ||
-        kthread_should_park())
+    if (kthread_should_park())
          return NULL;


I actually don't see why we need to keep the above,
on the other side (in drm_sched_stop) we won't touch the pending list
anyway until sched thread came to full stop (kthread_park). If you do see a reason why
this needed then a comment should be here i think.

Andrey


spin_lock(&sched->job_list_lock);
@@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
      if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
          /* remove job from pending_list */
          list_del_init(&job->list);
+
+        /* cancel this job's TO timer */
+        cancel_delayed_work(&sched->work_tdr);
          /* make the scheduled timestamp more accurate */
          next = list_first_entry_or_null(&sched->pending_list,
                          typeof(*next), list);
-        if (next)
+
+        if (next) {
              next->s_fence->scheduled.timestamp =
                  job->s_fence->finished.timestamp;
-
+            /* start TO timer for next job */
+            drm_sched_start_timeout(sched);
+        }
      } else {
          job = NULL;
-        /* queue timeout for next job */
-        drm_sched_start_timeout(sched);
      }
        spin_unlock(&sched->job_list_lock);
@@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
                        (entity = drm_sched_select_entity(sched))) ||
                       kthread_should_stop());
  -        if (cleanup_job) {
+        if (cleanup_job)
              sched->ops->free_job(cleanup_job);
-            /* queue timeout for next job */
-            drm_sched_start_timeout(sched);
-        }
            if (!entity)
              continue;
>From d4671ce3c3b18c369b512cd692aec3769f37e11a Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Date: Thu, 26 Aug 2021 16:08:01 -0400
Subject: drm/sched: Add TTL per job for timeout handling.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
 drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++--
 include/drm/gpu_scheduler.h            |  2 ++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ecf8140f6968..c8e31515803c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -306,6 +306,7 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
 
 	spin_lock(&sched->job_list_lock);
 	list_add_tail(&s_job->list, &sched->pending_list);
+	s_job->ts = get_jiffies_64();
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
 }
@@ -695,10 +696,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 						typeof(*next), list);
 
 		if (next) {
+			uint64_t ttl;
+
 			next->s_fence->scheduled.timestamp =
 				job->s_fence->finished.timestamp;
-			/* start TO timer for next job */
-			drm_sched_start_timeout(sched);
+
+			/*
+			 * Make precise calculation how much time should be
+			 * left for the next job before reaming timer. In case
+			 *  it's TTL expired scheduler TO handler right away.
+			 */
+			ttl = get_jiffies_64() - job->ts;
+			if (likely(ttl < sched->timeout))
+				mod_delayed_work(system_wq, &sched->work_tdr, ttl);
+			else
+				mod_delayed_work(system_wq, &sched->work_tdr, 0);
 		}
 	} else {
 		job = NULL;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d18af49fd009..80cc23e799cf 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -182,6 +182,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
  * @s_priority: the priority of the job.
  * @entity: the entity to which this job belongs.
  * @cb: the callback for the parent fence in s_fence.
+ * @ts: ts to messure for how long the job been running in jiffies
  *
  * A job is created by the driver using drm_sched_job_init(), and
  * should call drm_sched_entity_push_job() once it wants the scheduler
@@ -198,6 +199,7 @@ struct drm_sched_job {
 	enum drm_sched_priority		s_priority;
 	struct drm_sched_entity         *entity;
 	struct dma_fence_cb		cb;
+	uint64_t 			ts;
 };
 
 static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
-- 
2.25.1


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux