On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:Am 15.11.24 um 14:55 schrieb Philipp Stanner:[SNIP]But you wouldn't want to aim for getting rid of struct drm_sched_job completely, or would you?No, the drm_sched_job structure was added to aid the single producer single consumer queue and so made it easier to put the jobs into a container. In my mind the full solution for running jobs looks like this: 1. Driver enqueues with drm_sched_job_arm() 2. job ends up in pending_list 3. Sooner or later scheduler calls run_job() 4. In return scheduler gets a dma_fence representing the resulting hardware operation. 5, This fence is put into a container to keep around what the hw is actually executing. 6. At some point the fence gets signaled informing the scheduler that the next job can be pushed if enough credits are now available. There is no free_job callback any more because after run_job is called the scheduler is done with the job and only the dma_fence which represents the actually HW operation is the object the scheduler now works with. We would still need something like a kill_job callback which is used when an entity is released without flushing all jobs (see drm_sched_entity_kill()), but that is then only used for a specific corner case.Can't we just limit the scheduler's responsibility to telling the driver that it has to flush, and if it doesn't it's a bug?
We still need to remove the pending jobs from the scheduler if flushing times out.
Without timing out flushing and/or aborting when the process was killed we run into the same problem again that we don't want ti block on _fini().
Blocking for the cleanup in drm_sched_fini() then becomes a trivial dma_fence_wait() on the remaining unsignaled HW fences and eventually on the latest killed fence.But that results in exactly the same situation as my waitque solution, doesn't it?
The key part is that dma_fence's have a documented requirement to never block infinitely.
Blocking infinitely in drm_sched_fini(): If the driver doesn't guarantee that all fences will get signaled, then wait_event() in the waitque solution will block forever. The same with dma_fence_wait(). Or are you aiming at an interruptible dma_fence_wait(), thereby not blocking SIGKILL?
That is basically what drm_sched_entity_flush() is already doing.
That then would still result in leaks. So basically the same situation as with an interruptible drm_sched_flush() function. (Although I agree that would probably be more elegant)
If the wait_event really would just waiting for dma_fences then yes.
The problem with the waitqueue approach is that we need to wait for the free_job callback and that callback is normally called from a work item without holding any additional locks.
When drm_sched_fini starts to block for that we create a rat-tail of new dependencies since that one is most likely called from a file descriptor destruction.
The problem with this approach is that the idea of re-submitting jobs in a GPU reset by the scheduler is then basically dead. But to be honest that never worked correctly. See the deprecated warning I already put on drm_sched_resubmit_jobs(). The job lifetime is not really well defined because of this, see the free_guilty hack as well.drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if we go for this approach we have to port them, first. That doesn't sound trivial to me
Yeah, completely agree. I think the scheduler should provide something like drm_sched_for_each_pending_fence() which helps to iterate over all the pending HW fences.
The individual drivers could then device by themself what to do, e.g. upcast the HW fence into the job and push it again or similar.
But what we really need to get away from are those fence pre-requisite violations Sima pointed out. For example we can't allocate memory for run_job.
Regards,
Christian.
P.Regards, Christian.Grüße, P.