Re: [RFC PATCH] drm/amdgpu: allocate entities on demand

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Mon, 2 Dec 2019 15:59:09 +0100

Am 02.12.19 um 15:43 schrieb Nirmoy:

On 11/29/19 7:42 PM, Christian König wrote:
Am 29.11.19 um 15:29 schrieb Nirmoy:
Hi Christian,

On 11/26/19 10:45 AM, Christian König wrote:
It looks like a start, but there numerous things which needs to be 
fixed.

Question number one is: What's that good for? Entities are not the 
problem here. The real issue is the fence ring and the rq_list.

The rq_list could actually be made constant since it should never 
be changed by the entity. It is only changed for backward 
compatibility in drm_sched_entity_set_priority().

So I would start there and cleanup the 
drm_sched_entity_set_priority() to actually just set a new constant 
rq list instead.

I am missing some context here. Can you please explain bit more? I 
looked over and over again but I still don't understand what do you 
mean by  new constant rq list :/

Ok that needs a bit wider explanation.

The GPU scheduler consists mainly of drm_gpu_scheduler instances. 
Each of those instances contain multiple runqueues with different 
priorities (5 IIRC).

Now for each entity we give a list of runqueues where this entity can 
be served on, e.g. where the jobs which are pushed to the entities 
are executed.

The entity itself keeps a copy of that runqueue list because we have 
the drm_sched_entity_set_priority() which modifies this runqueue list.

But essentially that is complete overkill, the runqueue lists are 
constant for each amdgpu device, e.g. all contexts should use SDMA0 
and SDMA1 in the same way.

In other words building the list on runqueues should happen only once 
and not for each contexts.
Okay I understand now the real problem. Thanks for detail explanation.

Multiple approach to fix this would be possible. One rather elegant 
solution would be to change the rq list into a scheduler instances 
list + priority.

Do you mean something like

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 684692a8ed76..ac67f8f098fa 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -81,7 +81,7 @@ enum drm_sched_priority {
 struct drm_sched_entity {
        struct list_head                list;
        struct drm_sched_rq             *rq;
-       struct drm_sched_rq             **rq_list;
+      struct drm_gpu_scheduler        **sched;
        unsigned int                    num_rq_list;
        spinlock_t                      rq_lock;

Yes, exactly. Problem is that I'm not 100% sure if that really works 
with all users of the rq_list.

Regards,
Christian.




This way we would also fix the age old bug that changing the priority 
of a context could actually mess up already scheduled jobs.

The alternative I noted before would be to drop 
drm_sched_entity_set_priority() or change it into 
drm_sched_entity_set_runqueues().
I was working on it but then I got stuck by a  "BUG: kernel NULL 
pointer dereference, address:" which I am trying to figure out why

Regards,
Christian.



Then we could embed the fences in amdgpu_ctx_entity as dynamic 
array at the end of the structure.

And last we can start to dynamic allocate and initialize the 
amdgpu_ctx_entity() structures.

Regards,
Christian.



_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx