Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 15.11.24 um 14:55 schrieb Philipp Stanner:
[SNIP]
A good bunch of the problems we have here are caused by abusing the
job 
as state for executing the submission on the hardware.

The original design idea of the scheduler instead was to only have
the 
job as intermediate state between submission and picking what to
execute 
next. E.g. the scheduler in that view is just a pipeline were you
push 
jobs in on one end and jobs come out on the other end.
So let's get a bit more precise about this:
   1. Driver enqueues with drm_sched_job_arm()
   2. job ends up in pending_list
   3. Sooner or later scheduler calls run_job()
   4. Job is pushed to hardware
   5. Fence gets signaled
   6. ???

What would the "come out on the other end" part you describe look like?

How would jobs get removed from pending_list and, accordingly, how
would we avoid leaks?

Let me describe the full solution a bit further down.

In that approach the object which represents the hardware execution
is 
the dma_fence instead of the job. And the dma_fence has a well
defined 
lifetime, error handling, etc...

So when we go back to this original approach it would mean that we
don't 
need to wait for any cleanup because the scheduler wouldn't be
involved 
in the execution of the jobs on the hardware any more.
It would be involved (to speak precisely) in the sense of the scheduler
still being the one who pushes the job onto the hardware, agreed?

Yes, exactly.

See in the original design the "job" wasn't even a defined structure, but rather just a void*.

So basically what the driver did was telling the scheduler here I have a bunch of different void* please tell me which one to run first.

And apart from being this identifier this void* had absolutely no meaning for the scheduler.

The worst thing that could happen is that the driver messes things up
and still has not executed job in an entity,
I can't fully follow.

So in your mind, the driver would personally set the dma_fence callback
and hereby be informed about the job being completed, right?

No. The run_job callback would still return the hw fence so that the scheduler can install the callback and so gets informed when the next job could be run.

But you wouldn't want to aim for getting rid of struct drm_sched_job
completely, or would you?

No, the drm_sched_job structure was added to aid the single producer single consumer queue and so made it easier to put the jobs into a container.


In my mind the full solution for running jobs looks like this:

1. Driver enqueues with drm_sched_job_arm()
2. job ends up in pending_list
3. Sooner or later scheduler calls run_job()
4. In return scheduler gets a dma_fence representing the resulting hardware operation.
5, This fence is put into a container to keep around what the hw is actually executing.
6. At some point the fence gets signaled informing the scheduler that the next job can be pushed if enough credits are now available.

There is no free_job callback any more because after run_job is called the scheduler is done with the job and only the dma_fence which represents the actually HW operation is the object the scheduler now works with.

We would still need something like a kill_job callback which is used when an entity is released without flushing all jobs (see drm_sched_entity_kill()), but that is then only used for a specific corner case.

Blocking for the cleanup in drm_sched_fini() then becomes a trivial dma_fence_wait() on the remaining unsignaled HW fences and eventually on the latest killed fence.

The problem with this approach is that the idea of re-submitting jobs in a GPU reset by the scheduler is then basically dead. But to be honest that never worked correctly.

See the deprecated warning I already put on drm_sched_resubmit_jobs(). The job lifetime is not really well defined because of this, see the free_guilty hack as well.

Regards,
Christian.



Grüße,
P.


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux