Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Boris,

Am 02.05.23 um 13:19 schrieb Boris Brezillon:
Hello Christian, Alex,

As part of our transition to drm_sched for the powervr GPU driver, we
realized drm_sched_resubmit_jobs(), which is used by all drivers
relying on drm_sched right except amdgpu, has been deprecated.
Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
as an alternative.

At the very least, for our implementation, we need to restore the
drm_sched_job::parent pointers that were set to NULL in
drm_sched_stop(), such that jobs submitted before the GPU recovery are
considered active when drm_sched_start() is called. That could be done
with a custom pending_list iteration restoring drm_sched_job::parent's
pointer, but that seems odd to let the scheduler backend manipulate
this list directly, and I suspect we need to do other checks, like the
karma vs hang-limit thing, so we can flag the entity dirty and cancel
all jobs being queued there if the entity has caused too many hangs.

Now that drm_sched_resubmit_jobs() has been deprecated, that would be
great if you could help us write a piece of documentation describing
what should be done between drm_sched_stop() and drm_sched_start(), so
new drivers don't come up with their own slightly different/broken
version of the same thing.

Yeah, really good point! The solution is to not use drm_sched_stop() and drm_sched_start() either.

The general idea Daniel, the other Intel guys and me seem to have agreed on is to convert the scheduler thread into a work item.

This work item for pushing jobs to the hw can then be queued to the same workqueue we use for the timeout work item.

If this workqueue is now configured by your driver as single threaded you have a guarantee that only either the scheduler or the timeout work item is running at the same time. That in turn makes starting/stopping the scheduler for a reset completely superfluous.

Patches for this has already been floating on the mailing list, but haven't been committed yet. Since this is all WIP.

In general it's not really a good idea to change the scheduler and hw fences during GPU reset/recovery. The dma_fence implementation has a pretty strict state transition which clearly say that a dma_fence should never go back from signaled to unsignaled and when you start messing with that this is exactly what might happen.

What you can do is to save your hw state and re-start at the same location after handling the timeout.

Regards,
Christian.


Thanks in advance for your help.

Regards,

Boris




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux