Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Christian König <christian.koenig@xxxxxxx> · Mon, 18 Nov 2024 12:48:10 +0100

    Am 15.11.24 um 17:07 schrieb Philipp Stanner:

      On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:

         Am 15.11.24 um 14:55 schrieb Philipp Stanner:

          [SNIP] 

          But you wouldn't want to aim for getting rid of struct
drm_sched_job
completely, or would you?

 No, the drm_sched_job structure was added to aid the single producer
single consumer queue and so made it easier to put the jobs into a
container.

 In my mind the full solution for running jobs looks like this:

 1. Driver enqueues with drm_sched_job_arm()
 2. job ends up in pending_list
 3. Sooner or later scheduler calls run_job()
 4. In return scheduler gets a dma_fence representing the resulting
hardware operation.
 5, This fence is put into a container to keep around what the hw is
actually executing.
 6. At some point the fence gets signaled informing the scheduler
that the next job can be pushed if enough credits are now available.

 There is no free_job callback any more because after run_job is
called the scheduler is done with the job and only the dma_fence
which represents the actually HW operation is the object the
scheduler now works with.

 We would still need something like a kill_job callback which is used
when an entity is released without flushing all jobs (see
drm_sched_entity_kill()), but that is then only used for a specific
corner case.

      Can't we just limit the scheduler's responsibility to telling the
driver that it has to flush, and if it doesn't it's a bug?

    We still need to remove the pending jobs from the scheduler if
    flushing times out.

    Without timing out flushing and/or aborting when the process was
    killed we run into the same problem again that we don't want ti
    block on _fini().

         Blocking for the cleanup in drm_sched_fini() then becomes a trivial
dma_fence_wait() on the remaining unsignaled HW fences and eventually
on the latest killed fence.

      But that results in exactly the same situation as my waitque solution,
doesn't it?

    The key part is that dma_fence's have a documented requirement to
    never block infinitely.

      Blocking infinitely in drm_sched_fini():

If the driver doesn't guarantee that all fences will get signaled, then
wait_event() in the waitque solution will block forever. The same with
dma_fence_wait().

Or are you aiming at an interruptible dma_fence_wait(), thereby not
blocking SIGKILL?

    That is basically what drm_sched_entity_flush() is already doing.

      That then would still result in leaks. So basically the same situation
as with an interruptible drm_sched_flush() function.

(Although I agree that would probably be more elegant)

    If the wait_event really would just waiting for dma_fences then yes.

    The problem with the waitqueue approach is that we need to wait for
    the free_job callback and that callback is normally called from a
    work item without holding any additional locks.

    When drm_sched_fini starts to block for that we create a rat-tail of
    new dependencies since that one is most likely called from a file
    descriptor destruction.

 The problem with this approach is that the idea of re-submitting
jobs in a GPU reset by the scheduler is then basically dead. But to
be honest that never worked correctly.

 See the deprecated warning I already put on
drm_sched_resubmit_jobs(). The job lifetime is not really well
defined because of this, see the free_guilty hack as well.

      drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if
we go for this approach we have to port them, first. That doesn't sound
trivial to me

    Yeah, completely agree. I think the scheduler should provide
    something like drm_sched_for_each_pending_fence() which helps to
    iterate over all the pending HW fences.

    The individual drivers could then device by themself what to do,
    e.g. upcast the HW fence into the job and push it again or similar.

    But what we really need to get away from are those fence
    pre-requisite violations Sima pointed out. For example we can't
    allocate memory for run_job.

    Regards,

    Christian.

P.

 Regards,
 Christian.

Grüße,
P.