Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

Danilo Krummrich <dakr@xxxxxxxxxx> · Wed, 8 Nov 2023 01:41:15 +0100

On 11/7/23 05:10, Luben Tuikov wrote:
Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
     Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov <ltuikov89@xxxxxxxxx>
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
  drivers/gpu/drm/scheduler/sched_main.c | 16 +++-------------
  1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
  }
  
  /**
- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
   * @sched: scheduler instance
   */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  {
  	if (!READ_ONCE(sched->pause_submit))
  		queue_work(sched->submit_wq, &sched->work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
  {
  	if (drm_sched_can_queue(sched))
-		__drm_sched_run_job_queue(sched);
+		drm_sched_run_job_queue(sched);
  }
  
  /**
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
  }
  EXPORT_SYMBOL(drm_sched_pick_best);
  
-/**
- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-	if (drm_sched_select_entity(sched))

Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that
we probably need the peek semantics here. If we do not select an entity here, we also
do not check whether the corresponding job fits on the ring.

Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would
be that we don't detect that we need to wait for credits to free up before the run work is
already executing and the run work selects an entity.

- Danilo

-		__drm_sched_run_job_queue(sched);
-}
-
  /**
   * drm_sched_free_job_work - worker to call free_job
   *

base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3