Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 25 Feb 2021 08:53:26 +0100

Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
Ping

Sorry, I've been on vacation this week.


Andrey

On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:

On 2/20/21 3:38 AM, Christian König wrote:


Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:

On 2/18/21 10:15 AM, Christian König wrote:
Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:

On 2/18/21 3:07 AM, Christian König wrote:


Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because 
drm_sched_entity_is_idle
never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy 
drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by 
now.

v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 31 
+++++++++++++++++++++++++++++++
  1 file changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
   */
  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
+    int i;
+    struct drm_sched_entity *s_entity;

BTW: Please order that so that i is declared last.

      if (sched->thread)
          kthread_stop(sched->thread);
  +    /* Detach all sched_entites from this scheduler once 
it's stopped */
+    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
DRM_SCHED_PRIORITY_MIN; i--) {
+        struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+        if (!rq)
+            continue;
+
+        /* Loop this way because rq->lock is taken in 
drm_sched_rq_remove_entity */
+        spin_lock(&rq->lock);
+        while ((s_entity = 
list_first_entry_or_null(&rq->entities,
+                                struct drm_sched_entity,
+                                list))) {
+            spin_unlock(&rq->lock);
+
+            /* Prevent reinsertion and remove */
+            spin_lock(&s_entity->rq_lock);
+            s_entity->stopped = true;
+            drm_sched_rq_remove_entity(rq, s_entity);
+            spin_unlock(&s_entity->rq_lock);

Well this spin_unlock/lock dance here doesn't look correct at 
all now.

Christian.


In what way ? It's in the same same order as in other call sites 
(see drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe while 
manually deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as 
the order of lock acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites 
mentioned above.

Ah, now I understand. You need this because 
drm_sched_rq_remove_entity() will grab the rq lock again!

Problem is now what prevents the entity from being destroyed while 
you remove it?

Christian.

Right, well, since (unfortunately) sched_entity is part of 
amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment amdgpu_ctx.refcount 
when assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement 
before removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and 
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit 
tricky due to all the drm_sched_entity_select_rq
logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in 
amdgpu_ctx_mgr_fini (which should be lock protected
anyway as I see from other idr usages in the code) ... This should 
prevent this use after free.

Puh, that's rather complicated as well. Ok let's look at it from the 
other side for a moment.

Why do we have to remove the entities from the rq in the first place?

Wouldn't it be sufficient to just set all of them to stopped?

Christian.


And adding it as another  condition in drm_sched_entity_is_idle ?

Yes, I think that should work.

Christian.



Andrey




Andrey




Andrey




+
+            spin_lock(&rq->lock);
+        }
+        spin_unlock(&rq->lock);
+
+    }
+
+    /* Wakeup everyone stuck in drm_sched_entity_flush for 
this scheduler */
+    wake_up_all(&sched->job_scheduled);
+
      /* Confirm no work left behind accessing device 
structures */
cancel_delayed_work_sync(&sched->work_tdr);




_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel