On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: > Am 15.11.24 um 14:55 schrieb Philipp Stanner: > > > [SNIP] > > > > > > A good bunch of the problems we have here are caused by abusing > > > the > > > job > > > as state for executing the submission on the hardware. > > > > > > The original design idea of the scheduler instead was to only > > > have > > > the > > > job as intermediate state between submission and picking what to > > > execute > > > next. E.g. the scheduler in that view is just a pipeline were you > > > push > > > jobs in on one end and jobs come out on the other end. > > > > > > > So let's get a bit more precise about this: > > 1. Driver enqueues with drm_sched_job_arm() > > 2. job ends up in pending_list > > 3. Sooner or later scheduler calls run_job() > > 4. Job is pushed to hardware > > 5. Fence gets signaled > > 6. ??? > > > > What would the "come out on the other end" part you describe look > > like? > > > > How would jobs get removed from pending_list and, accordingly, how > > would we avoid leaks? > > > > Let me describe the full solution a bit further down. > > > > > > > > > > > In that approach the object which represents the hardware > > > execution > > > is > > > the dma_fence instead of the job. And the dma_fence has a well > > > defined > > > lifetime, error handling, etc... > > > > > > So when we go back to this original approach it would mean that > > > we > > > don't > > > need to wait for any cleanup because the scheduler wouldn't be > > > involved > > > in the execution of the jobs on the hardware any more. > > > > > > > It would be involved (to speak precisely) in the sense of the > > scheduler > > still being the one who pushes the job onto the hardware, agreed? > > > > Yes, exactly. > > See in the original design the "job" wasn't even a defined > structure, but rather just a void*. > > So basically what the driver did was telling the scheduler here I > have a bunch of different void* please tell me which one to run > first. > > And apart from being this identifier this void* had absolutely no > meaning for the scheduler. Interesting.. > > > > > > > The worst thing that could happen is that the driver messes > > > things up > > > and still has not executed job in an entity, > > > > > > > I can't fully follow. > > > > So in your mind, the driver would personally set the dma_fence > > callback > > and hereby be informed about the job being completed, right? > > > > No. The run_job callback would still return the hw fence so that the > scheduler can install the callback and so gets informed when the next > job could be run. > > > > > > But you wouldn't want to aim for getting rid of struct > > drm_sched_job > > completely, or would you? > > > > No, the drm_sched_job structure was added to aid the single producer > single consumer queue and so made it easier to put the jobs into a > container. > > > In my mind the full solution for running jobs looks like this: > > 1. Driver enqueues with drm_sched_job_arm() > 2. job ends up in pending_list > 3. Sooner or later scheduler calls run_job() > 4. In return scheduler gets a dma_fence representing the resulting > hardware operation. > 5, This fence is put into a container to keep around what the hw is > actually executing. > 6. At some point the fence gets signaled informing the scheduler > that the next job can be pushed if enough credits are now available. > > There is no free_job callback any more because after run_job is > called the scheduler is done with the job and only the dma_fence > which represents the actually HW operation is the object the > scheduler now works with. > > We would still need something like a kill_job callback which is used > when an entity is released without flushing all jobs (see > drm_sched_entity_kill()), but that is then only used for a specific > corner case. Can't we just limit the scheduler's responsibility to telling the driver that it has to flush, and if it doesn't it's a bug? > Blocking for the cleanup in drm_sched_fini() then becomes a trivial > dma_fence_wait() on the remaining unsignaled HW fences and eventually > on the latest killed fence. But that results in exactly the same situation as my waitque solution, doesn't it? Blocking infinitely in drm_sched_fini(): If the driver doesn't guarantee that all fences will get signaled, then wait_event() in the waitque solution will block forever. The same with dma_fence_wait(). Or are you aiming at an interruptible dma_fence_wait(), thereby not blocking SIGKILL? That then would still result in leaks. So basically the same situation as with an interruptible drm_sched_flush() function. (Although I agree that would probably be more elegant) > > The problem with this approach is that the idea of re-submitting > jobs in a GPU reset by the scheduler is then basically dead. But to > be honest that never worked correctly. > > See the deprecated warning I already put on > drm_sched_resubmit_jobs(). The job lifetime is not really well > defined because of this, see the free_guilty hack as well. drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if we go for this approach we have to port them, first. That doesn't sound trivial to me P. > > Regards, > Christian. > > > > > > > > > > Grüße, > > P. > > > > >