On Thu, Nov 21, 2024 at 10:06:54AM +0100, Philipp Stanner wrote: > +CC Thomas Hellström > > > On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote: > > Am 15.11.24 um 17:07 schrieb Philipp Stanner: > > > > > > > > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: > > > > > > > > > > > Am 15.11.24 um 14:55 schrieb Philipp Stanner: > > > > > > > > > > > > > > > > > > [SNIP] > > > > > > > > > > > > > > > > > > > > > > > > > > But you wouldn't want to aim for getting rid of struct > > > > > drm_sched_job > > > > > completely, or would you? > > > > > > > > > > > > > > > > > > > > > > No, the drm_sched_job structure was added to aid the single > > > > producer > > > > single consumer queue and so made it easier to put the jobs into > > > > a > > > > container. > > > > > > > > > > > > In my mind the full solution for running jobs looks like this: > > > > > > > > 1. Driver enqueues with drm_sched_job_arm() > > > > 2. job ends up in pending_list > > > > 3. Sooner or later scheduler calls run_job() > > > > 4. In return scheduler gets a dma_fence representing the > > > > resulting > > > > hardware operation. > > > > 5, This fence is put into a container to keep around what the hw > > > > is > > > > actually executing. > > > > 6. At some point the fence gets signaled informing the scheduler > > > > that the next job can be pushed if enough credits are now > > > > available. > > > > > > > > There is no free_job callback any more because after run_job is > > > > called the scheduler is done with the job and only the dma_fence > > > > which represents the actually HW operation is the object the > > > > scheduler now works with. > > > > > > > > We would still need something like a kill_job callback which is > > > > used > > > > when an entity is released without flushing all jobs (see > > > > drm_sched_entity_kill()), but that is then only used for a > > > > specific > > > > corner case. > > > > > > > > > > Can't we just limit the scheduler's responsibility to telling the > > > driver that it has to flush, and if it doesn't it's a bug? > > > > > > > We still need to remove the pending jobs from the scheduler if > > flushing times out. > > > > Without timing out flushing and/or aborting when the process was > > killed we run into the same problem again that we don't want ti block > > on _fini(). > > > > > > > > > > > > > Blocking for the cleanup in drm_sched_fini() then becomes a > > > > trivial > > > > dma_fence_wait() on the remaining unsignaled HW fences and > > > > eventually > > > > on the latest killed fence. > > > > > > > > > > But that results in exactly the same situation as my waitque > > > solution, > > > doesn't it? > > > > > > > The key part is that dma_fence's have a documented requirement to > > never block infinitely. > > > > > > > > > > Blocking infinitely in drm_sched_fini(): > > > > > > If the driver doesn't guarantee that all fences will get signaled, > > > then > > > wait_event() in the waitque solution will block forever. The same > > > with > > > dma_fence_wait(). > > > > > > Or are you aiming at an interruptible dma_fence_wait(), thereby not > > > blocking SIGKILL? > > > > > > > That is basically what drm_sched_entity_flush() is already doing. > > > > > > > > > > That then would still result in leaks. So basically the same > > > situation > > > as with an interruptible drm_sched_flush() function. > > > > > > (Although I agree that would probably be more elegant) > > > > > > > If the wait_event really would just waiting for dma_fences then yes. > > > > The problem with the waitqueue approach is that we need to wait for > > the free_job callback and that callback is normally called from a > > work item without holding any additional locks. > > > > When drm_sched_fini starts to block for that we create a rat-tail of > > new dependencies since that one is most likely called from a file > > descriptor destruction. > > > > > > > > > > > > > > > > > > > > > > > > > The problem with this approach is that the idea of re-submitting > > > > jobs in a GPU reset by the scheduler is then basically dead. But > > > > to > > > > be honest that never worked correctly. > > > > > > > > See the deprecated warning I already put on > > > > drm_sched_resubmit_jobs(). The job lifetime is not really well > > > > defined because of this, see the free_guilty hack as well. > > > > > > > > > > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So > > > if > > > we go for this approach we have to port them, first. That doesn't > > > sound > > > trivial to me > > > > > > > Yeah, completely agree. I think the scheduler should provide > > something like drm_sched_for_each_pending_fence() which helps to > > iterate over all the pending HW fences. > > > > The individual drivers could then device by themself what to do, > > e.g. upcast the HW fence into the job and push it again or similar. > > I have talked with Dave and we would be interested in exploring the > direction of getting rid of backend_ops.free_job() and doing the > modifications this implies. > Sorry again, catching up late. Got any details to dropping free_job? Gut reaction is this is fairly large subsystem rework, touching many drivers, and great way to regress stable software. > Matthew, Thomas, any hard NACKs on principle, or can we look into this > in a future patch series? > Definietly not NACK outright, but see my concerns above. If ends up not being all that invasive and good cleanup, not opposed. If it starts messing with existing reset / job cancel / queue teardown flows I would be concerned given how hard it is to get though correct. Matt > > P. > > > > > > > > > But what we really need to get away from are those fence pre- > > requisite violations Sima pointed out. For example we can't allocate > > memory for run_job. > > > > Regards, > > Christian. > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > Regards, > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grüße, > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >