On Fri, Mar 07, 2025 at 10:37:04AM +0100, Philipp Stanner wrote: > On Thu, 2025-03-06 at 12:57 -0800, Matthew Brost wrote: > > On Wed, Mar 05, 2025 at 02:05:52PM +0100, Philipp Stanner wrote: > > > drm_sched_backend_ops.timedout_job()'s documentation is outdated. > > > It > > > mentions the deprecated function drm_sched_resubmit_jobs(). > > > Furthermore, > > > it does not point out the important distinction between hardware > > > and > > > firmware schedulers. > > > > > > Since firmware schedulers typically only use one entity per > > > scheduler, > > > timeout handling is significantly more simple because the entity > > > the > > > faulted job came from can just be killed without affecting innocent > > > processes. > > > > > > Update the documentation with that distinction and other details. > > > > > > Reformat the docstring to work to a unified style with the other > > > handles. > > > > > > > Looks really good, one suggestion. > > Already merged. But I'm working already on the TODO and could address > your feedback in that followup. > > Of course, would also be great if you could provide a proposal in a > patch? :) > I can post something. Let me try to get something out today. Matt > > > > > Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx> > > > --- > > > include/drm/gpu_scheduler.h | 78 ++++++++++++++++++++++----------- > > > ---- > > > 1 file changed, 47 insertions(+), 31 deletions(-) > > > > > > diff --git a/include/drm/gpu_scheduler.h > > > b/include/drm/gpu_scheduler.h > > > index 6381baae8024..1a7e377d4cbb 100644 > > > --- a/include/drm/gpu_scheduler.h > > > +++ b/include/drm/gpu_scheduler.h > > > @@ -383,8 +383,15 @@ struct drm_sched_job { > > > struct xarray dependencies; > > > }; > > > > > > +/** > > > + * enum drm_gpu_sched_stat - the scheduler's status > > > + * > > > + * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use. > > > + * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded. > > > + * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available > > > anymore. > > > + */ > > > enum drm_gpu_sched_stat { > > > - DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */ > > > + DRM_GPU_SCHED_STAT_NONE, > > > DRM_GPU_SCHED_STAT_NOMINAL, > > > DRM_GPU_SCHED_STAT_ENODEV, > > > }; > > > @@ -447,43 +454,52 @@ struct drm_sched_backend_ops { > > > * @timedout_job: Called when a job has taken too long to > > > execute, > > > * to trigger GPU recovery. > > > * > > > - * This method is called in a workqueue context. > > > + * @sched_job: The job that has timed out > > > * > > > - * Drivers typically issue a reset to recover from GPU > > > hangs, and this > > > - * procedure usually follows the following workflow: > > > + * Drivers typically issue a reset to recover from GPU > > > hangs. > > > + * This procedure looks very different depending on > > > whether a firmware > > > + * or a hardware scheduler is being used. > > > * > > > - * 1. Stop the scheduler using drm_sched_stop(). This will > > > park the > > > - * scheduler thread and cancel the timeout work, > > > guaranteeing that > > > - * nothing is queued while we reset the hardware queue > > > - * 2. Try to gracefully stop non-faulty jobs (optional) > > > - * 3. Issue a GPU reset (driver-specific) > > > - * 4. Re-submit jobs using drm_sched_resubmit_jobs() > > > - * 5. Restart the scheduler using drm_sched_start(). At > > > that point, new > > > - * jobs can be queued, and the scheduler thread is > > > unblocked > > > + * For a FIRMWARE SCHEDULER, each ring has one scheduler, > > > and each > > > + * scheduler has one entity. Hence, the steps taken > > > typically look as > > > + * follows: > > > + * > > > + * 1. Stop the scheduler using drm_sched_stop(). This will > > > pause the > > > + * scheduler workqueues and cancel the timeout work, > > > guaranteeing > > > + * that nothing is queued while the ring is being > > > removed. > > > + * 2. Remove the ring. The firmware will make sure that > > > the > > > + * corresponding parts of the hardware are resetted, > > > and that other > > > + * rings are not impacted. > > > + * 3. Kill the entity and the associated scheduler. > > > > Xe doesn't do step 3. > > > > It does: > > - Ban entity / scheduler so futures submissions are a NOP. This would > > be > > submissions with unmet dependencies. Submission at the IOCTL are > > disallowed > > - Signal all job's fences on the pending list > > - Restart scheduler so free_job() is naturally called > > > > I'm unsure if this how other firmware schedulers do this, but it > > seems > > to work quite well in Xe. > > Alright, so if I interpret this correctly you do that to avoid our > infamous memory leaks. That makes sense. > > The memory leaks are documented in drm_sched_fini()'s docu, but it > could make sense to mention them here, too. > > … thinking about it, we probably actually have to rephrase this line. > Just tearing down entity & sched makes those leaks very likely. Argh. > > Nouveau, also a firmware scheduler, has effectively a copy of the > pending_list and also ensures that all fences get signalled. Only once > that copy of the pending list is empty it calls into drm_sched_fini(). > Take a look at nouveau_sched.c if you want, the code is quite > straightforward. > > P. > > > > > Matt > > > > > + * > > > + * > > > + * For a HARDWARE SCHEDULER, a scheduler instance > > > schedules jobs from > > > + * one or more entities to one ring. This implies that all > > > entities > > > + * associated with the affected scheduler cannot be torn > > > down, because > > > + * this would effectively also affect innocent userspace > > > processes which > > > + * did not submit faulty jobs (for example). > > > + * > > > + * Consequently, the procedure to recover with a hardware > > > scheduler > > > + * should look like this: > > > + * > > > + * 1. Stop all schedulers impacted by the reset using > > > drm_sched_stop(). > > > + * 2. Kill the entity the faulty job stems from. > > > + * 3. Issue a GPU reset on all faulty rings (driver- > > > specific). > > > + * 4. Re-submit jobs on all schedulers impacted by re- > > > submitting them to > > > + * the entities which are still alive. > > > + * 5. Restart all schedulers that were stopped in step #1 > > > using > > > + * drm_sched_start(). > > > * > > > * Note that some GPUs have distinct hardware queues but > > > need to reset > > > * the GPU globally, which requires extra synchronization > > > between the > > > - * timeout handler of the different &drm_gpu_scheduler. > > > One way to > > > - * achieve this synchronization is to create an ordered > > > workqueue > > > - * (using alloc_ordered_workqueue()) at the driver level, > > > and pass this > > > - * queue to drm_sched_init(), to guarantee that timeout > > > handlers are > > > - * executed sequentially. The above workflow needs to be > > > slightly > > > - * adjusted in that case: > > > + * timeout handlers of different schedulers. One way to > > > achieve this > > > + * synchronization is to create an ordered workqueue > > > (using > > > + * alloc_ordered_workqueue()) at the driver level, and > > > pass this queue > > > + * as drm_sched_init()'s @timeout_wq parameter. This will > > > guarantee > > > + * that timeout handlers are executed sequentially. > > > * > > > - * 1. Stop all schedulers impacted by the reset using > > > drm_sched_stop() > > > - * 2. Try to gracefully stop non-faulty jobs on all queues > > > impacted by > > > - * the reset (optional) > > > - * 3. Issue a GPU reset on all faulty queues (driver- > > > specific) > > > - * 4. Re-submit jobs on all schedulers impacted by the > > > reset using > > > - * drm_sched_resubmit_jobs() > > > - * 5. Restart all schedulers that were stopped in step #1 > > > using > > > - * drm_sched_start() > > > + * Return: The scheduler's status, defined by &enum > > > drm_gpu_sched_stat > > > * > > > - * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal, > > > - * and the underlying driver has started or completed > > > recovery. > > > - * > > > - * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no > > > longer > > > - * available, i.e. has been unplugged. > > > */ > > > enum drm_gpu_sched_stat (*timedout_job)(struct > > > drm_sched_job *sched_job); > > > > > > -- > > > 2.48.1 > > > >