Re: [PATCH 4/4] drm/scheduler: do not keep a copy of sched list

Christian König <christian.koenig@xxxxxxx> · Tue, 10 Dec 2019 18:32:23 +0100

Am 10.12.19 um 16:08 schrieb Nirmoy:
I think amdgpu_ctx_init() should check for num_scheds and not call 
drm_sched_entity_init()

if its zero.

Ah, that's where that came from. No that is intentionally this way, but 
see below.


On 12/10/19 3:47 PM, Nirmoy wrote:

On 12/10/19 2:00 PM, Christian König wrote:
Am 10.12.19 um 13:53 schrieb Nirmoy Das:
entity should not keep copy and maintain sched list for
itself.

Signed-off-by: Nirmoy Das <nirmoy.das@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_entity.c | 10 +---------
  1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index f9b6ce29c58f..a5f729f421f8 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -67,17 +67,10 @@ int drm_sched_entity_init(struct 
drm_sched_entity *entity,
      entity->guilty = guilty;
      entity->num_sched_list = num_sched_list;
      entity->priority = priority;
-    entity->sched_list =  kcalloc(num_sched_list,
-                      sizeof(struct drm_gpu_scheduler *), 
GFP_KERNEL);
+    entity->sched_list =  sched_list;

Maybe make this "num_sched_list > 1 ? sched_list : NULL" to avoid 
accidentally dereferencing a stale pointer to the stack.
Do you mean "num_sched_list >= 1 ? sched_list : NULL"

No, the entity->sched_list field should be NULL when num_sched_list==1.

When num_sched_list==1 the entity->sched_list shouldn't be used and we 
can use a dummy list on the stack as parameter. But we should set the 
pointer to NULL in this case just to make sure that nobody is 
dereferencing it.


  -    if(!entity->sched_list)
-        return -ENOMEM;
        init_completion(&entity->entity_idle);
-
-    for (i = 0; i < num_sched_list; i++)
-        entity->sched_list[i] = sched_list[i];
-
      if (num_sched_list)

That check can actually be dropped as well. We return -EINVAL when 
the num_sched_list is zero.

This  one took me some time to understand. So we don't really return 
-EINVAL if num_sched_list == 0

if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
                return -EINVAL;

This is coming from below patch. Are we suppose to tolerate IPs with 
uninitialized sched so that ctx creation dosn't return error ?

Yes, exactly. That's intentionally this way.

GPU reset sometimes resulted in schedulers being disabled because we 
couldn't re-init them. In this case we had entities with an empty 
scheduler list.

That can't happen any more after you make the scheduler arrays constant, 
but I would stick with that behavior.

Regards,
Christian.


commit 1decbf6bb0b4dc56c9da6c5e57b994ebfc2be3aa
Author: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
Date:   Wed Jan 30 02:53:19 2019 +0100

    drm/sched: Fix entities with 0 rqs.

    Some blocks in amdgpu can have 0 rqs.

    Job creation already fails with -ENOENT when entity->rq is NULL,
    so jobs cannot be pushed. Without a rq there is no scheduler to
    pop jobs, and rq selection already does the right thing with a
    list of length 0.

    So the operations we need to fix are:
      - Creation, do not set rq to rq_list[0] if the list can have 
length 0.
      - Do not flush any jobs when there is no rq.
      - On entity destruction handle the rq = NULL case.
      - on set_priority, do not try to change the rq if it is NULL.


Regards,
Christian.

          entity->rq = 
&entity->sched_list[0]->sched_rq[entity->priority];
  @@ -312,7 +305,6 @@ void drm_sched_entity_fini(struct 
drm_sched_entity *entity)
        dma_fence_put(entity->last_scheduled);
      entity->last_scheduled = NULL;
-    kfree(entity->sched_list);
  }
  EXPORT_SYMBOL(drm_sched_entity_fini);


_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx