Re: [PATCH] drm/scheduler: improve GPU scheduler documentation

Luben Tuikov <ltuikov89@xxxxxxxxx> · Thu, 16 Nov 2023 22:18:18 -0500

On 2023-11-13 07:38, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.

Thanks Christian for doing this--much needed.

> 
> Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 126 ++++++++++++++++++++-----
>  1 file changed, 104 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 506371c42745..36a7c5dc852d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,110 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.

So, I don't mind this paragraph, as it provides an overview or the relationship between
a DRM GPU scheduler, entities, run-queues, and jobs.

> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.

This is also good, and shouldn't have been deleted.

> - *
> - * The jobs in a entity are always scheduled in the order that they were pushed.

I'd have said here "jobs within an entity". Again shouldn't have been deleted.
This is good overall/overview information.

> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case for the GPU scheduler
> + * is to enforce correct driver behavior around those command submission.

The GPU scheduler also enforces correct driver behaviour around those command submissions.

> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.

... but need to push jobs into their firmware/hardware and maintain/keep correct
DRM dependencies in the form of "fences".

> + *
> + * To fulfill this task the GPU scheduler uses of the following objects:
> + *
> + * 1. The job object which contains a bunch of dependencies in the form of

Drop "which".

Instead of listing what it contains, how it is used, what it does, explain
what it is: work/task to be executed by the GPU. _Then_ you can start listing
what it contains, how it is used, what it does.

> + *    DMA-fence objects. Drivers can also implement an optional prepare_job
> + *    callback which returns additional dependencies as DMA-fence objects.

"can also"? This would usually follow if the other callbacks/etc., have been described
and they haven't, so I'd say "Drivers implement an optional prepare_job callback,..."

> + *    It's important to note that this callback must follow the DMA-fence rules,
> + *    so it can't easily allocate memory or grab locks under which memory is
> + *    allocated. Drivers should use this as base class for an object which
> + *    contains the necessary state to push the command submission to the
> + *    hardware.
> + *
> + *    The lifetime of the job object should at least be from pushing it into the
> + *    scheduler until the scheduler notes through the free callback that a job
> + *    isn't needed any more. Drivers can of course keep their job object alive
> + *    longer than that, but that's outside of the scope of the scheduler
> + *    component.

[New paragraph starts describing the job initialization.]

Add:  Job initialization is split into two parts,
> + *    drm_sched_job_init() and drm_sched_job_arm().

Perhaps we should mention briefly what each one does..?

Add:  It's important to note that
> + *    after arming a job drivers must follow the DMA-fence rules and can't
> + *    easily allocate memory or takes locks under which memory is allocated.
> + *
> + * 2. The entity object which is a container for jobs which should execute

Drop "which". "The entity object is a container of ..."

> + *    sequentially. Drivers should create an entity for each individual context
> + *    they maintain for command submissions which can run in parallel.

This isn't very clear, but can be made so by: "they maintain for command submissions."
"Entities' jobs can run in parallel as facilitated by the GPU."

> + *
> + *    The lifetime of the entity should *not* exceed the lifetime of the
> + *    userspace process it was created for and drivers should call the
> + *    drm_sched_entity_flush() function from their file_operations.flush
> + *    callback. 

Okay... they should... WHEN? When we use "should do something" we always clarify with "when xyz happens."

Add:  Background is that for compatibility reasons with existing
> + *    userspace all results of a command submission should become visible

"Background" --> "The reason for this is for compatibility with existing ..."

> + *    externally even after after a process exits. The only exception to that

Remove one of the two "after".

> + *    is when the process is actively killed by a SIGKILL. In this case the
> + *    entity object makes sure that jobs are freed without running them while
> + *    still maintaining correct sequential order for signaling fences. So it's
> + *    possible that an entity object is not alive any more while jobs from it

"to not be alive"
(This paragraph here, including SIGKILL, could be made clearer, by splitting it in two
parts. One describing normal behaviour, i.e. SIGTERM, and the other describing SIGKILL.)

> + *    are still running on the hardware.
> + *
> + * 3. The hardware fence object which is a DMA-fence provided by the driver as

Drop "which". "The hardware fence object is a DMA-fence which is provided by the driver
as _a_ result of running a job/jobs."

> + *    result of running jobs. Drivers need to make sure that the normal
> + *    DMA-fence semantics are followed for this object. It's important to note

"DMA-fence semantics" are mentioned several times so far, and a link to a description
of said semantics (or NULL if none is in the kernel)--would be nice to put.

> + *    that the memory for this object can *not* be allocated in the run_job
> + *    callback since that would violate the requirements for the DMA-fence
> + *    implementation.

Is it "no allocation" or just "allocation for this object" in run_job?
(Or maybe no kernel allocation..., we should probably clarify this.)

Add:  The scheduler maintains a timeout handler which triggers
> + *    if this fence doesn't signal in a configurable time frame.
> + *
> + *    The lifetime of this object follows DMA-fence ref-counting rules, the
> + *    scheduler takes ownership of the reference returned by the driver and
> + *    drops it when it's not needed any more. Errors should also be signaled
> + *    through the hardware fence and are bubbled up back to the scheduler fence

By whom?
"through the hardware fence by the driver, and are ..."

> + *    and entity.
> + *
> + * 4. The scheduler fence object which encapsulates the whole time from pushing
> + *    the job into the scheduler until the hardware has finished processing it.

Perhaps drop "time encapsulation."

It's not very clear what you want to say here. Perhaps, use "exists" and drop "which", as in:
4. The scheduler fence object exists/is ref-ed by DRM, etc., from the time when the job is
pushed into the scheduler until the hardware has finished with it.

If fence signaling is involved in those two steps, it should be noted here.

If this is about ownership, it should be simply stated.

> + *    This is internally managed by the scheduler, but drivers can grab
> + *    additional reference to it after arming a job. The implementation

Instead of "implementation" perhaps use "DRM scheduler code?"

> + *    provides DMA-fence interfaces for signaling both scheduling of a command
> + *    submission as well as finishing of processing.

I'd clarify with
	"provides DMA-fence interfaces for drivers, for the scheduling of a command
	 submission, akin to the start of a command, and finishing command processing."

Perhaps we can also mention when drivers are supposed to call these...?

> + *
> + *    The lifetime of this object also follows normal DMA-fence ref-counting
> + *    rules. The finished fence is the one normally exposed outside of the
> + *    scheduler, but the driver can grab references to both the scheduled as
> + *    well as the finished fence when needed for pipe-lining optimizations.
> + *
> + * 5. The run queue object which is a container of entities for a certain
> + *    priority level. The lifetime of those objects are bound to the scheduler
> + *    lifetime.

"which is" --> "is"
"entities for a certain" --> "entities of a certain"

> + *
> + *    This is internally managed by the scheduler and drivers shouldn't touch
> + *    them directly.
> + *
> + * 6. The scheduler object itself which does the actual work of selecting a job
> + *    and pushing it to the hardware. Both FIFO and RR selection algorithm are
> + *    supported, but FIFO is preferred for many use cases.

Let's drop "which" and just say "The scheduler object does the actual work of ..."

> + *
> + *    The lifetime of this object is managed by the driver using it. Before
> + *    destroying the scheduler the driver must ensure that all hardware
> + *    processing involving this scheduler object has finished by calling for
> + *    example disable_irq(). It is *not* sufficient to wait for the hardware
> + *    fence here since this doesn't guarantee that all callback processing has
> + *    finished.

Okay, and perhaps we could mention drm_sched_fini() here.

> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated. All memory which

Need a pointer to the "DMA-fence signaling rules" and also need to define "normal operation",
or clarify it in the sentence.

> + * is needed for pushing the job to the hardware must be allocated before
> + * arming a job. It also means that no locks can be taken under which memory
> + * might be allocated as well.

Yes, that makes sense. I think this is generally the case for driver and firmware
writers, and I'd think is common sense, but it is good to have it in writing.

> + *
> + * Memory which is optional to allocate for device core dumping or debugging
> + * *must* be allocated with GFP_NOWAIT and appropriate error handling taking if
> + * that allocation fails. GFP_ATOMIC should only be used if absolutely
> + * necessary since dipping into the special atomic reserves is usually not
> + * justified for a GPU driver.

Yes, that's well said. Good to have it here.

> + *
> + * The scheduler also used to provided functionality for re-submitting jobs

"The scheduler also used ..." --> "The scheduler is also used ..."

> + * with replacing the hardware fence during reset handling. This functionality
> + * is now marked as deprecated since this has proven to be fundamentally racy
> + * and not compatible with DMA-fence rules and shouldn't be used in any new
> + * code.

I wouldn't use a contraction here. In order to emphasize that job re-submission
is depreciated, I'd use:
	"and should not be used in any new code."

>   */
>  
>  #include <linux/kthread.h>

-- 
Regards,
Luben

Attachment:
OpenPGP_0x4C15479431A334AF.asc

Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature