Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Philipp Stanner <pstanner@xxxxxxxxxx> · Fri, 15 Nov 2024 17:07:04 +0100

On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
>  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
>  
> > [SNIP] 
> > >  
> > > A good bunch of the problems we have here are caused by abusing
> > > the
> > > job 
> > > as state for executing the submission on the hardware.
> > > 
> > > The original design idea of the scheduler instead was to only
> > > have
> > > the 
> > > job as intermediate state between submission and picking what to
> > > execute 
> > > next. E.g. the scheduler in that view is just a pipeline were you
> > > push 
> > > jobs in on one end and jobs come out on the other end.
> > >  
> >  
> > So let's get a bit more precise about this:
> >    1. Driver enqueues with drm_sched_job_arm()
> >    2. job ends up in pending_list
> >    3. Sooner or later scheduler calls run_job()
> >    4. Job is pushed to hardware
> >    5. Fence gets signaled
> >    6. ???
> > 
> > What would the "come out on the other end" part you describe look
> > like?
> > 
> > How would jobs get removed from pending_list and, accordingly, how
> > would we avoid leaks?
> >  
>  
>  Let me describe the full solution a bit further down.
>  
>  
>  
> >  
> > >  
> > > In that approach the object which represents the hardware
> > > execution
> > > is 
> > > the dma_fence instead of the job. And the dma_fence has a well
> > > defined 
> > > lifetime, error handling, etc...
> > > 
> > > So when we go back to this original approach it would mean that
> > > we
> > > don't 
> > > need to wait for any cleanup because the scheduler wouldn't be
> > > involved 
> > > in the execution of the jobs on the hardware any more.
> > >  
> >  
> > It would be involved (to speak precisely) in the sense of the
> > scheduler
> > still being the one who pushes the job onto the hardware, agreed?
> >  
>  
>  Yes, exactly.
>  
>  See in the original design the "job" wasn't even a defined
> structure, but rather just a void*.
>  
>  So basically what the driver did was telling the scheduler here I
> have a bunch of different void* please tell me which one to run
> first.
>  
>  And apart from being this identifier this void* had absolutely no
> meaning for the scheduler.

Interesting..

>  
> > >  
> > > The worst thing that could happen is that the driver messes
> > > things up
> > > and still has not executed job in an entity,
> > >  
> >  
> > I can't fully follow.
> > 
> > So in your mind, the driver would personally set the dma_fence
> > callback
> > and hereby be informed about the job being completed, right?
> >  
>  
>  No. The run_job callback would still return the hw fence so that the
> scheduler can install the callback and so gets informed when the next
> job could be run.
>  
>  
> >  
> > But you wouldn't want to aim for getting rid of struct
> > drm_sched_job
> > completely, or would you?
> >  
>  
>  No, the drm_sched_job structure was added to aid the single producer
> single consumer queue and so made it easier to put the jobs into a
> container.
>  
>  
>  In my mind the full solution for running jobs looks like this:
>  
>  1. Driver enqueues with drm_sched_job_arm()
>  2. job ends up in pending_list
>  3. Sooner or later scheduler calls run_job()
>  4. In return scheduler gets a dma_fence representing the resulting
> hardware operation.
>  5, This fence is put into a container to keep around what the hw is
> actually executing.
>  6. At some point the fence gets signaled informing the scheduler
> that the next job can be pushed if enough credits are now available.
>  
>  There is no free_job callback any more because after run_job is
> called the scheduler is done with the job and only the dma_fence
> which represents the actually HW operation is the object the
> scheduler now works with.
>  
>  We would still need something like a kill_job callback which is used
> when an entity is released without flushing all jobs (see
> drm_sched_entity_kill()), but that is then only used for a specific
> corner case.

Can't we just limit the scheduler's responsibility to telling the
driver that it has to flush, and if it doesn't it's a bug?

>  Blocking for the cleanup in drm_sched_fini() then becomes a trivial
> dma_fence_wait() on the remaining unsignaled HW fences and eventually
> on the latest killed fence.

But that results in exactly the same situation as my waitque solution,
doesn't it?

Blocking infinitely in drm_sched_fini():

If the driver doesn't guarantee that all fences will get signaled, then
wait_event() in the waitque solution will block forever. The same with
dma_fence_wait().

Or are you aiming at an interruptible dma_fence_wait(), thereby not
blocking SIGKILL?

That then would still result in leaks. So basically the same situation
as with an interruptible drm_sched_flush() function.

(Although I agree that would probably be more elegant)

>  
>  The problem with this approach is that the idea of re-submitting
> jobs in a GPU reset by the scheduler is then basically dead. But to
> be honest that never worked correctly.
>  
>  See the deprecated warning I already put on
> drm_sched_resubmit_jobs(). The job lifetime is not really well
> defined because of this, see the free_guilty hack as well.

drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if
we go for this approach we have to port them, first. That doesn't sound
trivial to me

P.

>  
>  Regards,
>  Christian.
>  
>  
> >  
> > 
> > 
> > Grüße,
> > P.
> > 
> >  
>