Re: [PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2

Luben Tuikov <ltuikov89@xxxxxxxxx> · Fri, 17 Nov 2023 17:32:03 -0500

Hi,

On 2023-11-16 09:15, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.
> 
> v2: Some improvements suggested by Danilo, add section about error
>     handling.
> 
> Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> ---
>  Documentation/gpu/drm-mm.rst           |  36 +++++
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>  2 files changed, 188 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>  
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case of the GPU scheduler

You can't start a 2nd sentence with "Another major use ...", not unless you'd been
discussing the other major use for at least a few paragraphs, but not after just
once sentence.

Get rid of "some" in "some logic", and just say "implements logic to ..."

Then 2nd sentence should say, "The GPU scheduler also enforces correct ..."

> + * is to enforce correct driver behavior around those command submissions.
> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.
> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially

"deadlock-free"

Link to "DMA-fence signaling rules" would be nice to have. Can't mention them,
and not provide a link. Naturally someone reading this would immediately ask themselves,
"What are the ``DMA-fence signaling rules''?", and if they don't need to ask themselves
this, then they probably mostly know all of this here too.

> + * means that for normal operation no memory can be allocated in a callback.

What callback? Perhaps say, "callback into the driver", or name it/them,
as they're in the code.

> + * All memory which is needed for pushing the job to the hardware must be

"pushing _a_ job"

> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated as well.
> + *
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if
> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object contains submission dependencies in the form of DMA-fence
> + * objects. Drivers can also implement an optional prepare_job callback which

Drop "also".

> + * returns additional dependencies as DMA-fence objects. It's important to note
> + * that this callback can't allocate memory or grab locks under which memory is
> + * allocated.

Probably say "that this callback into the driver cannot allocate" for clarity.

> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.

"this"? The context is lost by this sentence and you should say "Drivers should
use the job structure as a base for an object which contains ..."

I'd stray away from using the word "class".

> + *
> + * The lifetime of the job object should at least be from pushing it into the
> + * scheduler until the scheduler notes through the free callback that a job

"the free callback"? Perhaps be more specific here with something like,
"struct drm_sched_backend_jobs::free_job callback".

> + * isn't needed any more. Drivers can of course keep their job object alive
> + * longer than that, but that's outside of the scope of the scheduler
> + * component. Job initialization is split into two parts, drm_sched_job_init()

This "Job initialization is split into ..." needs to start on a new paragraph.

> + * and drm_sched_job_arm(). It's important to note that after arming a job

These two functions need to be listed, numeric or alphabetic, something like,

	Job initialization is split into two parts,
		a) drm_sched_job_init(), which <explain here what it does and
		   why drivers should/would call it and what is expected from them>, and
		b) drm_sched_arm_job(), which <explain here what this does and
		   why drivers should/would call this and what is expected from them>.

> + * drivers must follow the DMA-fence rules and can't easily allocate memory
> + * or takes locks under which memory is allocated.

I wouldn't use contractions for things which are forbidden and would do

s/can't/cannot/g

A link to those "DMA-fence rules" would be helpful.

> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object which is a container for jobs which should execute

Drop "which" globally.
"Container of" not "for".
s/should/generally/

To become this,

	The entity object is a container of jobs which generally execute ...

Still a better way to explain what something _is_, is to start with,

	The entity object represents a context/user process/etc., which generates
	jobs to be executed by the GPU. It holds incoming jobs in its "job_queue".
	Entities are generally allowed to run on one or more schedulers via their
	"sched_list" member.

Something like that.

> + * sequentially. Drivers should create an entity for each individual context
> + * they maintain for command submissions which can run in parallel.

Drop "should" to become,
	Drivers create an entity ...

> + *
> + * The lifetime of the entity should *not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush

"their" here would be ambiguous to a first-time reader. Perhaps
clarify that it is the process' file ops.

> + * callback. So it's possible that an entity object is not alive any
> + * more while jobs from it are still running on the hardware.
> + *
> + * Background is that for compatibility reasons with existing

"Background" --> "The reason for this is for compatibility with existing ..."

> + * userspace all results of a command submission should become visible
> + * externally even after after a process exits. This is normal POSIX behavior

Remove one of the "after".

> + * for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for
> + * this object can *not* be allocated in the run_job callback since that would
> + * violate the requirements for the DMA-fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal in
> + * a configurable time frame.

"frame" --> "period" or "interval".

It would greatly help to suggest where/how the hw fence should be allocated,
and give suggestions to drivers.

> + *
> + * The lifetime of this object follows DMA-fence ref-counting rules, the
> + * scheduler takes ownership of the reference returned by the driver and drops
> + * it when it's not needed any more.

Clarify the callback:
"of the reference returned by the driver in the run_job callback and drops it ..."

"when it's not needed any more" should be clarified since the lifetime is attempted
to be clarified in this paragraph here.

> + */
> +
> +/**
> + * DOC: Scheduler Fence Object

As these all refer to real structs, it would help to name them
in the DOC, as in "struct drm_sched_fence".

> + *
> + * The scheduler fence object which encapsulates the whole time from pushing

See my comments in v1.

Also, "The scheduler fence object, struct drm_sched_fence, ..."
Name it explicitly so that the reader knows exactly which structure it is.

> + * the job into the scheduler until the hardware has finished processing it.
> + * This is internally managed by the scheduler, but drivers can grab additional
> + * reference to it after arming a job. The implementation provides DMA-fence
> + * interfaces for signaling both scheduling of a command submission as well as
> + * finishing of processing.

"for signalling both scheduling of a command" is confusing.

I'd probably list them in an alphabetized list, as in
	The drm_sched_fence structure contains two DMA fences,
	a) "scheduled" which is signalled by ..., when ..., and
	b) "finished" which is signalled by ... when ...

> + *
> + * The lifetime of this object also follows normal DMA-fence ref-counting
> + * rules. The finished fence is the one normally exposed outside of the
> + * scheduler, but the driver can grab references to both the scheduled as well
> + * as the finished fence when needed for pipe-lining optimizations.

"as well as" --> "and"

> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects

Perhaps list their names,
struct drm_gpu_scheduler,
struct drm_sched_rq.

> + *
> + * The scheduler object itself does the actual work of selecting a job and
> + * pushing it to the hardware. Both FIFO and RR selection algorithm are
> + * supported, but FIFO is preferred for many use cases.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.

Perhaps give an alternative of what *is* sufficient, as to give guidelines
to driver writers.

> + *
> + * The run queue object is a container of entities for a certain priority

"for a certain" --> "of a certain"

> + * level. This object is internally managed by the scheduler and drivers
> + * shouldn't touch them directly. The lifetime of run queues are bound to the

"touch it directly"
"is bound to"

> + * schedulers lifetime.
>   */
>  
>  /**
> @@ -72,6 +166,42 @@
>   * limit.
>   */
>  
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the hardware

Run this patch through spell-check.

> + * fence object before signaling it. Errors are then bubbled up from the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in the run_job callback as well as preventing pushing further
> + * ones into the entity in the drivers submission function.
> + *
> + * When the hardware fence fails to signal in a configurable amount of time the
> + * timedout_job callback is issued. The driver should then follow the procedure
> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
> + * The timeout handler should probably switch to using the hardware fence as
> + * parameter instead of the job. Otherwise the handling will always race
> + * between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs

"scheduler is also used to provide"

> + * with replacing the hardware fence during reset handling. This functionality
> + * is now marked as deprecated. This has proven to be fundamentally racy and
> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
> + *
> + * Additional there is the drm_sched_increase_karma() function which tries to

"In addition," or "Additionally,"

> + * find the entity which submitted a job and increases it's 'karma'
> + * atomic variable to prevent re-submitting jobs from this entity. This has
> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
> + * using this function is rather discouraged.

Perhaps add,
	"This function, as well as this 'karma'-business is slated for removal."

> + *
> + * Drivers can still re-create the GPU state should it be lost during timeout
> + * handling when they can guarantee that forward progress is made and this
> + * doesn't cause another timeout. But this is strongly hardware specific and
> + * out of the scope of the general GPU scheduler.
> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>

-- 
Regards,
Luben
Attachment:
OpenPGP_0x4C15479431A334AF.asc

Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature