Re: [PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> · Thu, 6 Feb 2025 13:53:02 +0000

On 06/02/2025 13:46, Christian König wrote:
Am 06.02.25 um 14:35 schrieb Philipp Stanner:
On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
The helper copies code from the existing
amdgpu_job_stop_all_jobs_on_sched
with the purpose of reducing the amount of driver code which directly
touch scheduler internals.

If or when amdgpu manages to change the approach for handling the
permanently wedged state this helper can be removed.
Have you checked how many other drivers might need such a helper?

I have a bit mixed feelings about this, because, AFAICT, in the past
helpers have been added for just 1 driver, such as
drm_sched_wqueue_ready(), and then they have stayed for almost a
decade.

AFAIU this is just code move, and only really "decouples" amdgpu in the
sense of having an official scheduler function that does what amdgpu
used to do.

So my tendency here would be to continue "allowing" amdgpu to touch the
scheduler internals until amdgpu fixes this "permanently wedged
state". And if that's too difficult, couldn't the helper reside in a
amdgpu/sched_helpers.c or similar?

I think that's better than adding 1 helper for just 1 driver and then
supposedly removing it again in the future.

Yeah, agree to that general approach.

What amdgpu does here is kind of nasty and looks unnecessary, but 
changing it means we need time from Hawkings and his people involved on 
RAS for amdgpu.

When we move the code to the scheduler we make it official scheduler 
interface to others to replicate and that is exactly what we should try 
to avoid.

So my suggestion is to add a /* TODO: This is nasty and should be 
avoided */ to the amdgpu code instead.

So I got a no go to export a low level queue pop helper, no go to move 
the whole dodgy code to common (reasonable). Any third way to break the 
status quo? What if I respin with just a change local to amdgpu which 
would, instead of duplicating the to_drm_sched_job macro, duplicate 
__drm_sched_entity_queue_pop from 3/4 of this series?

Regards,

Tvrtko


Regards,
Christian.


P.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Danilo Krummrich <dakr@xxxxxxxxxx>
Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
Cc: Philipp Stanner <phasta@xxxxxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 44
++++++++++++++++++++++++++
  include/drm/gpu_scheduler.h            |  1 +
  2 files changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index a48be16ab84f..0363655db22d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler
*sched, int errno)
  }
  EXPORT_SYMBOL(drm_sched_start);
+/**
+ * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
+ *
+ * @sched: scheduler instance
+ * @errno: error value to set on signaled fences
+ *
+ * Signal all queued and scheduled jobs and set them to error state.
+ *
+ * Scheduler must be stopped before calling this.
+ */
+void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
errno)
+{
+    struct drm_sched_entity *entity;
+    struct drm_sched_fence *s_fence;
+    struct drm_sched_job *job;
+    enum drm_sched_priority p;
+
+    drm_WARN_ON_ONCE(sched, !sched->pause_submit);
+
+    /* Signal all jobs not yet scheduled */
+    for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++)
{
+        struct drm_sched_rq *rq = sched->sched_rq[p];
+
+        spin_lock(&rq->lock);
+        list_for_each_entry(entity, &rq->entities, list) {
+            while ((job =
to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
+                s_fence = job->s_fence;
+                dma_fence_signal(&s_fence-
scheduled);
+                dma_fence_set_error(&s_fence-
finished, errno);
+                dma_fence_signal(&s_fence-
finished);
+            }
+        }
+        spin_unlock(&rq->lock);
+    }
+
+    /* Signal all jobs already scheduled to HW */
+    list_for_each_entry(job, &sched->pending_list, list) {
+        s_fence = job->s_fence;
+        dma_fence_set_error(&s_fence->finished, errno);
+        dma_fence_signal(&s_fence->finished);
+    }
+}
+EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
+
  /**
   * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
   *
diff --git a/include/drm/gpu_scheduler.h
b/include/drm/gpu_scheduler.h
index a0ff08123f07..298513f8c327 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
drm_gpu_scheduler *sched);
  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
drm_sched_job *bad);
  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
+void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
errno);
  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
  void drm_sched_increase_karma(struct drm_sched_job *bad);
  void drm_sched_reset_karma(struct drm_sched_job *bad);