Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/9/23 20:24, Danilo Krummrich wrote:
On 11/9/23 07:52, Luben Tuikov wrote:
Hi,

On 2023-11-07 19:41, Danilo Krummrich wrote:
On 11/7/23 05:10, Luben Tuikov wrote:
Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
      Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov <ltuikov89@xxxxxxxxx>
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
   drivers/gpu/drm/scheduler/sched_main.c | 16 +++-------------
   1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
   }
   /**
- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
    * @sched: scheduler instance
    */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
   {
       if (!READ_ONCE(sched->pause_submit))
           queue_work(sched->submit_wq, &sched->work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
   void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
   {
       if (drm_sched_can_queue(sched))
-        __drm_sched_run_job_queue(sched);
+        drm_sched_run_job_queue(sched);
   }
   /**
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
   }
   EXPORT_SYMBOL(drm_sched_pick_best);
-/**
- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-    if (drm_sched_select_entity(sched))

Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that
we probably need the peek semantics here. If we do not select an entity here, we also
do not check whether the corresponding job fits on the ring.

Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would
be that we don't detect that we need to wait for credits to free up before the run work is
already executing and the run work selects an entity.

So I rebased v5 on top of the latest drm-misc-next, and looked around and found out that
drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like the following,

Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch.

My point is that by removing drm_sched_select_entity() from drm_sched_run_job_queue() we do not
only loose the check whether the selected entity is ready, but also whether we have enough
credits to actually run a new job. This can lead to queuing up work that does nothing but calling
drm_sched_select_entity() and return.

Ok, I see it now.  We don't need to peek, we know the entity at drm_sched_wakeup().

However, the missing drm_sched_entity_is_ready() check should have been added already when
drm_sched_select_entity() was removed. Gonna send a fix for that as well.

- Danilo


By peeking the entity we could know this *before* scheduling work and hence avoid some CPU scheduler
overhead.

However, since this patch already landed and we can fail the same way if the selected entity isn't
ready I don't consider this to be a blocker for the credit patch, hence I will send out a v6.


void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
              struct drm_sched_entity *entity)
{
    if (drm_sched_entity_is_ready(entity))
        if (drm_sched_can_queue(sched, entity))
            drm_sched_run_job_queue(sched);
}

See the attached patch. (Currently running with base-commit and the attached patch.)




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux