Re: [PATCH 3/3] drm/sched: Update timedout_job()'s documentation

Philipp Stanner <pstanner@xxxxxxxxxx> · Mon, 20 Jan 2025 11:07:44 +0100

On Mon, 2025-01-13 at 12:46 +0100, Danilo Krummrich wrote:
> On Thu, Jan 09, 2025 at 02:37:12PM +0100, Philipp Stanner wrote:
> > drm_sched_backend_ops.timedout_job()'s documentation is outdated.
> > It
> > mentions the deprecated function drm_sched_resubmit_job().
> > Furthermore,
> > it does not point out the important distinction between hardware
> > and
> > firmware schedulers.
> > 
> > Since firmware schedulers tyipically only use one entity per
> > scheduler,
> > timeout handling is significantly more simple because the entity
> > the
> > faulted job came from can just be killed without affecting innocent
> > processes.
> > 
> > Update the documentation with that distinction and other details.
> > 
> > Reformat the docstring to work to a unified style with the other
> > handles.
> > 
> > Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
> > ---
> >  include/drm/gpu_scheduler.h | 83 +++++++++++++++++++++++----------
> > ----
> >  1 file changed, 52 insertions(+), 31 deletions(-)
> > 
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index c4e65f9f7f22..380b8840c591 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -445,43 +445,64 @@ struct drm_sched_backend_ops {
> >  	 * @timedout_job: Called when a job has taken too long to
> > execute,
> >  	 * to trigger GPU recovery.
> >  	 *
> > -	 * This method is called in a workqueue context.
> > +	 * @sched_job: The job that has timed out
> >  	 *
> > -	 * Drivers typically issue a reset to recover from GPU
> > hangs, and this
> > -	 * procedure usually follows the following workflow:
> > +	 * Returns:
> > +	 * - DRM_GPU_SCHED_STAT_NOMINAL, on success, i.e., if the
> > underlying
> > +	 *   driver has started or completed recovery.
> > +	 * - DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
> > +	 *   available, i.e., has been unplugged.
> 
> Maybe it'd be better to document this at the enum level and add a
> link.
> 
> >  	 *
> > -	 * 1. Stop the scheduler using drm_sched_stop(). This will
> > park the
> > -	 *    scheduler thread and cancel the timeout work,
> > guaranteeing that
> > -	 *    nothing is queued while we reset the hardware queue
> > -	 * 2. Try to gracefully stop non-faulty jobs (optional)
> > -	 * 3. Issue a GPU reset (driver-specific)
> > -	 * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> > +	 * Drivers typically issue a reset to recover from GPU
> > hangs.
> > +	 * This procedure looks very different depending on
> > whether a firmware
> > +	 * or a hardware scheduler is being used.
> > +	 *
> > +	 * For a FIRMWARE SCHEDULER, each (pseudo-)ring has one
> > scheduler, and
> > +	 * each scheduler (typically) has one entity. Hence, you
> > typically
> 
> I think you can remove the first "typically" here. I don't think that
> for a
> firmware scheduler we ever have something else than a 1:1 relation,
> besides that
> having something else than a 1:1 relation (whatever that would be)
> would likely
> be a contradiction to the statement above.
> 
> > +	 * follow those steps:
> > +	 *
> > +	 * 1. Stop the scheduler using drm_sched_stop(). This will
> > pause the
> > +	 *    scheduler workqueues and cancel the timeout work,
> > guaranteeing
> > +	 *    that nothing is queued while we reset the hardware
> > queue.
> 
> Rather reset / remove the software queue / ring.
> 
> (Besides: I think we should also define a distinct terminology,
> sometimes "queue"
> refers to the ring buffer, sometimes it refers to the entity, etc. At
> least we
> should be consequent within this commit, and then fix the rest.)

Very good idea!

How about we, from now on, always just call it "ring" or "hardware
ring"?

I think queue is very generic and, as you point out, often used for the
entities and stuff like that.

> 
> > +	 * 2. Try to gracefully stop non-faulty jobs (optional).
> > +	 * TODO: RFC ^ Folks, should we remove this step? What
> > does it even mean
> > +	 * precisely to "stop" those jobs? Is that even helpful to
> > userspace in
> > +	 * any way?
> 
> I think this means to prevent unrelated queues / jobs from being
> impacted by the
> subsequent GPU reset.
> 
> So, this is likely not applicable here, see below.
> 
> > +	 * 3. Issue a GPU reset (driver-specific).
> 
> I'm not entirely sure it really applies to all GPUs that feature a FW
> scheduler,
> but I'd expect that the FW takes care of resetting the corresponding
> HW
> (including preventing impact on other FW rings) if the faulty FW ring
> is removed
> by the driver.

@Christian: Agree? AMD is still purely a HW scheduler afaik.

> 
> > +	 * 4. Kill the entity the faulted job stems from, and the
> > associated
> > +	 *    scheduler.
> >  	 * 5. Restart the scheduler using drm_sched_start(). At
> > that point, new
> > -	 *    jobs can be queued, and the scheduler thread is
> > unblocked
> > +	 *    jobs can be queued, and the scheduler workqueues
> > awake again.
> 
> How can we start the scheduler again after we've killed it? I think
> the most
> likely scenario is that the FW ring (including the scheduler
> structures) is
> removed entirely and simply re-created. So, I think we can probably
> remove 5.

ACK

> 
> > +	 *
> > +	 * For a HARDWARE SCHEDULER, each ring also has one
> > scheduler, but each
> > +	 * scheduler typically has many attached entities. This
> > implies that you
> 
> Maybe better "associated". For the second sentence, I'd rephrase it,
> e.g. "This
> implies that all entities associated with the affected scheduler
> can't be torn
> down, because [...]".
> 
> > +	 * cannot tear down all entities associated with the
> > affected scheduler,
> > +	 * because this would effectively also kill innocent
> > userspace processes
> > +	 * which did not submit faulty jobs (for example).
> > +	 *
> > +	 * Consequently, the procedure to recover with a hardware
> > scheduler
> > +	 * should look like this:
> > +	 *
> > +	 * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop().
> > +	 * 2. Figure out to which entity the faulted job belongs.
> 
> "belongs to"
> 
> > +	 * 3. Try to gracefully stop non-faulty jobs (optional).
> > +	 * TODO: RFC ^ Folks, should we remove this step? What
> > does it even mean
> > +	 * precisely to "stop" those jobs? Is that even helpful to
> > userspace in
> > +	 * any way?
> 
> See above.

Agree with all. Will fix those in v2.

Danke,
P.

> 
> > +	 * 4. Kill that entity.
> > +	 * 5. Issue a GPU reset on all faulty rings (driver-
> > specific).
> > +	 * 6. Re-submit jobs on all schedulers impacted by re-
> > submitting them to
> > +	 *    the entities which are still alive.
> > +	 * 7. Restart all schedulers that were stopped in step #1
> > using
> > +	 *    drm_sched_start().
> >  	 *
> >  	 * Note that some GPUs have distinct hardware queues but
> > need to reset
> >  	 * the GPU globally, which requires extra synchronization
> > between the
> > -	 * timeout handler of the different &drm_gpu_scheduler.
> > One way to
> > -	 * achieve this synchronization is to create an ordered
> > workqueue
> > -	 * (using alloc_ordered_workqueue()) at the driver level,
> > and pass this
> > -	 * queue to drm_sched_init(), to guarantee that timeout
> > handlers are
> > -	 * executed sequentially. The above workflow needs to be
> > slightly
> > -	 * adjusted in that case:
> > -	 *
> > -	 * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop()
> > -	 * 2. Try to gracefully stop non-faulty jobs on all queues
> > impacted by
> > -	 *    the reset (optional)
> > -	 * 3. Issue a GPU reset on all faulty queues (driver-
> > specific)
> > -	 * 4. Re-submit jobs on all schedulers impacted by the
> > reset using
> > -	 *    drm_sched_resubmit_jobs()
> > -	 * 5. Restart all schedulers that were stopped in step #1
> > using
> > -	 *    drm_sched_start()
> > -	 *
> > -	 * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
> > -	 * and the underlying driver has started or completed
> > recovery.
> > -	 *
> > -	 * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no
> > longer
> > -	 * available, i.e. has been unplugged.
> > +	 * timeout handlers of different schedulers. One way to
> > achieve this
> > +	 * synchronization is to create an ordered workqueue
> > (using
> > +	 * alloc_ordered_workqueue()) at the driver level, and
> > pass this queue
> > +	 * as drm_sched_init()'s @timeout_wq parameter. This will
> > guarantee
> > +	 * that timeout handlers are executed sequentially.
> >  	 */
> >  	enum drm_gpu_sched_stat (*timedout_job)(struct
> > drm_sched_job *sched_job);
> >  
> > -- 
> > 2.47.1
> > 
>