[SNIP]A good bunch of the problems we have here are caused by abusing the job as state for executing the submission on the hardware. The original design idea of the scheduler instead was to only have the job as intermediate state between submission and picking what to execute next. E.g. the scheduler in that view is just a pipeline were you push jobs in on one end and jobs come out on the other end.So let's get a bit more precise about this: 1. Driver enqueues with drm_sched_job_arm() 2. job ends up in pending_list 3. Sooner or later scheduler calls run_job() 4. Job is pushed to hardware 5. Fence gets signaled 6. ??? What would the "come out on the other end" part you describe look like? How would jobs get removed from pending_list and, accordingly, how would we avoid leaks?
Let me describe the full solution a bit further down.
In that approach the object which represents the hardware execution is the dma_fence instead of the job. And the dma_fence has a well defined lifetime, error handling, etc... So when we go back to this original approach it would mean that we don't need to wait for any cleanup because the scheduler wouldn't be involved in the execution of the jobs on the hardware any more.It would be involved (to speak precisely) in the sense of the scheduler still being the one who pushes the job onto the hardware, agreed?
Yes, exactly.
See in the original design the "job" wasn't even a defined structure, but rather just a void*.
So basically what the driver did was telling the scheduler here I have a bunch of different void* please tell me which one to run first.
And apart from being this identifier this void* had absolutely no meaning for the scheduler.
The worst thing that could happen is that the driver messes things up and still has not executed job in an entity,I can't fully follow. So in your mind, the driver would personally set the dma_fence callback and hereby be informed about the job being completed, right?
No. The run_job callback would still return the hw fence so that the scheduler can install the callback and so gets informed when the next job could be run.
But you wouldn't want to aim for getting rid of struct drm_sched_job completely, or would you?
No, the drm_sched_job structure was added to aid the single producer single consumer queue and so made it easier to put the jobs into a container.
In my mind the full solution for running jobs looks like this:
1. Driver enqueues with drm_sched_job_arm()
2. job ends up in pending_list
3. Sooner or later scheduler calls run_job()
4. In return scheduler gets a dma_fence representing the resulting hardware operation.
5, This fence is put into a container to keep around what the hw is actually executing.
6. At some point the fence gets signaled informing the scheduler that the next job can be pushed if enough credits are now available.
There is no free_job callback any more because after run_job is called the scheduler is done with the job and only the dma_fence which represents the actually HW operation is the object the scheduler now works with.
We would still need something like a kill_job callback which is used when an entity is released without flushing all jobs (see drm_sched_entity_kill()), but that is then only used for a specific corner case.
Blocking for the cleanup in drm_sched_fini() then becomes a trivial dma_fence_wait() on the remaining unsignaled HW fences and eventually on the latest killed fence.
The problem with this approach is that the idea of re-submitting jobs in a GPU reset by the scheduler is then basically dead. But to be honest that never worked correctly.
See the deprecated warning I already put on drm_sched_resubmit_jobs(). The job lifetime is not really well defined because of this, see the free_guilty hack as well.
Regards,
Christian.
Grüße, P.