Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Christian König <christian.koenig@xxxxxxx> · Wed, 8 Mar 2023 11:03:34 +0100

Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:

On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which
causes segfaults and other badness when job completion fences are
signaled after the scheduler is torn down.

Explicitly detach all jobs from their completion callbacks and free
them. This makes it possible to write a sensible safe abstraction for
drm_sched, without having to externally duplicate the tracking of
in-flight jobs.

This shouldn't regress any existing drivers, since calling
drm_sched_fini() with any pending jobs is broken and this change should
be a no-op if there are no pending jobs.

Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 27 
+++++++++++++++++++++++++--
  1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 5c0add2c7546..0aab1e0aebdd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
      struct drm_sched_entity *s_entity;
+    struct drm_sched_job *s_job, *tmp;
      int i;
  -    if (sched->thread)
-        kthread_stop(sched->thread);
+    if (!sched->thread)
+        return;
+
+    /*
+     * Stop the scheduler, detaching all jobs from their hardware 
callbacks
+     * and cleaning up complete jobs.
+     */
+    drm_sched_stop(sched, NULL);
+
+    /*
+     * Iterate through the pending job list and free all jobs.
+     * This assumes the driver has either guaranteed jobs are 
already stopped, or that
+     * otherwise it is responsible for keeping any necessary data 
structures for
+     * in-progress jobs alive even when the free_job() callback is 
called early (e.g. by
+     * putting them in its own queue or doing its own refcounting).
+     */
+    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
+        spin_lock(&sched->job_list_lock);
+        list_del_init(&s_job->list);
+        spin_unlock(&sched->job_list_lock);
+        sched->ops->free_job(s_job);
+    }

I would stop the kthread first, then delete all jobs without spinlock 
since nothing else can race against sched_fini?

If you do need the spinlock, It would need to guard 
list_for_each_entry too.

Well this case here actually should not happen in the first place.

Jobs depend on their device, so as long as there are jobs there should 
also be a reference to the scheduler.

What could be is that you have allocated a scheduler instance 
dynamically, but even then you should first tear down all entities and 
then the scheduler.

Regards,
Christian.


+
+    kthread_stop(sched->thread);
        for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
DRM_SCHED_PRIORITY_MIN; i--) {
          struct drm_sched_rq *rq = &sched->sched_rq[i];