Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Philipp Stanner <pstanner@xxxxxxxxxx> · Thu, 21 Nov 2024 10:06:54 +0100

+CC Thomas Hellström

On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote:
>  Am 15.11.24 um 17:07 schrieb Philipp Stanner:
>  
> >  
> > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
> >  
> > >  
> > >  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> > >  
> > >  
> > > >  
> > > > [SNIP] 
> > > >  
> > >  
> >  
> > >  
> > > >  
> > > > But you wouldn't want to aim for getting rid of struct
> > > > drm_sched_job
> > > > completely, or would you?
> > > >  
> > > >  
> > >  
> > >  
> > >  No, the drm_sched_job structure was added to aid the single
> > > producer
> > > single consumer queue and so made it easier to put the jobs into
> > > a
> > > container.
> > >  
> > >  
> > >  In my mind the full solution for running jobs looks like this:
> > >  
> > >  1. Driver enqueues with drm_sched_job_arm()
> > >  2. job ends up in pending_list
> > >  3. Sooner or later scheduler calls run_job()
> > >  4. In return scheduler gets a dma_fence representing the
> > > resulting
> > > hardware operation.
> > >  5, This fence is put into a container to keep around what the hw
> > > is
> > > actually executing.
> > >  6. At some point the fence gets signaled informing the scheduler
> > > that the next job can be pushed if enough credits are now
> > > available.
> > >  
> > >  There is no free_job callback any more because after run_job is
> > > called the scheduler is done with the job and only the dma_fence
> > > which represents the actually HW operation is the object the
> > > scheduler now works with.
> > >  
> > >  We would still need something like a kill_job callback which is
> > > used
> > > when an entity is released without flushing all jobs (see
> > > drm_sched_entity_kill()), but that is then only used for a
> > > specific
> > > corner case.
> > >  
> >  
> > Can't we just limit the scheduler's responsibility to telling the
> > driver that it has to flush, and if it doesn't it's a bug?
> >  
>  
>  We still need to remove the pending jobs from the scheduler if
> flushing times out.
>  
>  Without timing out flushing and/or aborting when the process was
> killed we run into the same problem again that we don't want ti block
> on _fini().
>    
> >  
> > >  
> > >  Blocking for the cleanup in drm_sched_fini() then becomes a
> > > trivial
> > > dma_fence_wait() on the remaining unsignaled HW fences and
> > > eventually
> > > on the latest killed fence.
> > >  
> >  
> > But that results in exactly the same situation as my waitque
> > solution,
> > doesn't it?
> >  
>  
>  The key part is that dma_fence's have a documented requirement to
> never block infinitely.
>  
>  
> >  
> > Blocking infinitely in drm_sched_fini():
> > 
> > If the driver doesn't guarantee that all fences will get signaled,
> > then
> > wait_event() in the waitque solution will block forever. The same
> > with
> > dma_fence_wait().
> > 
> > Or are you aiming at an interruptible dma_fence_wait(), thereby not
> > blocking SIGKILL?
> >  
>  
>  That is basically what drm_sched_entity_flush() is already doing.
>  
>  
> >  
> > That then would still result in leaks. So basically the same
> > situation
> > as with an interruptible drm_sched_flush() function.
> > 
> > (Although I agree that would probably be more elegant)
> >  
>  
>  If the wait_event really would just waiting for dma_fences then yes.
>  
>  The problem with the waitqueue approach is that we need to wait for
> the free_job callback and that callback is normally called from a
> work item without holding any additional locks.
>  
>  When drm_sched_fini starts to block for that we create a rat-tail of
> new dependencies since that one is most likely called from a file
> descriptor destruction.
>   
>  
> >  
> > 
> >  
> > >  
> > >  
> > >  The problem with this approach is that the idea of re-submitting
> > > jobs in a GPU reset by the scheduler is then basically dead. But
> > > to
> > > be honest that never worked correctly.
> > >  
> > >  See the deprecated warning I already put on
> > > drm_sched_resubmit_jobs(). The job lifetime is not really well
> > > defined because of this, see the free_guilty hack as well.
> > >  
> >  
> > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So
> > if
> > we go for this approach we have to port them, first. That doesn't
> > sound
> > trivial to me
> >  
>  
>  Yeah, completely agree. I think the scheduler should provide
> something like drm_sched_for_each_pending_fence() which helps to
> iterate over all the pending HW fences.
>  
>  The individual drivers could then device by themself what to do,
> e.g. upcast the HW fence into the job and push it again or similar.

I have talked with Dave and we would be interested in exploring the
direction of getting rid of backend_ops.free_job() and doing the
modifications this implies.

Matthew, Thomas, any hard NACKs on principle, or can we look into this
in a future patch series?

P.

>  
>  But what we really need to get away from are those fence pre-
> requisite violations Sima pointed out. For example we can't allocate
> memory for run_job.
>  
>  Regards,
>  Christian.
>  
>  
> >  
> > 
> > 
> > P.
> > 
> >  
> > >  
> > >  
> > >  Regards,
> > >  Christian.
> > >  
> > >  
> > >  
> > > >  
> > > >  
> > > > 
> > > > 
> > > > Grüße,
> > > > P.
> > > > 
> > > >  
> > > >  
> > >  
> > >  
> > >  
> >   
>  
>