[PATCH] drm/scheduler: improve GPU scheduler documentation

"Christian König" <ckoenig.leichtzumerken@xxxxxxxxx> · Mon, 13 Nov 2023 13:38:32 +0100

Start to improve the scheduler document. Especially document the
lifetime of each of the objects as well as the restrictions around
DMA-fence handling and userspace compatibility.

Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
 drivers/gpu/drm/scheduler/sched_main.c | 126 ++++++++++++++++++++-----
 1 file changed, 104 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 506371c42745..36a7c5dc852d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,110 @@
 /**
  * DOC: Overview
  *
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
- *
- * The organisation of the scheduler is the following:
- *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- *    the hardware.
- *
- * The jobs in a entity are always scheduled in the order that they were pushed.
- *
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * The GPU scheduler implements some logic to decide which command submission
+ * to push next to the hardware. Another major use case for the GPU scheduler
+ * is to enforce correct driver behavior around those command submission.
+ * Because of this it's also used by drivers which don't need the actual
+ * scheduling functionality.
+ *
+ * To fulfill this task the GPU scheduler uses of the following objects:
+ *
+ * 1. The job object which contains a bunch of dependencies in the form of
+ *    DMA-fence objects. Drivers can also implement an optional prepare_job
+ *    callback which returns additional dependencies as DMA-fence objects.
+ *    It's important to note that this callback must follow the DMA-fence rules,
+ *    so it can't easily allocate memory or grab locks under which memory is
+ *    allocated. Drivers should use this as base class for an object which
+ *    contains the necessary state to push the command submission to the
+ *    hardware.
+ *
+ *    The lifetime of the job object should at least be from pushing it into the
+ *    scheduler until the scheduler notes through the free callback that a job
+ *    isn't needed any more. Drivers can of course keep their job object alive
+ *    longer than that, but that's outside of the scope of the scheduler
+ *    component. Job initialization is split into two parts,
+ *    drm_sched_job_init() and drm_sched_job_arm(). It's important to note that
+ *    after arming a job drivers must follow the DMA-fence rules and can't
+ *    easily allocate memory or takes locks under which memory is allocated.
+ *
+ * 2. The entity object which is a container for jobs which should execute
+ *    sequentially. Drivers should create an entity for each individual context
+ *    they maintain for command submissions which can run in parallel.
+ *
+ *    The lifetime of the entity should *not* exceed the lifetime of the
+ *    userspace process it was created for and drivers should call the
+ *    drm_sched_entity_flush() function from their file_operations.flush
+ *    callback. Background is that for compatibility reasons with existing
+ *    userspace all results of a command submission should become visible
+ *    externally even after after a process exits. The only exception to that
+ *    is when the process is actively killed by a SIGKILL. In this case the
+ *    entity object makes sure that jobs are freed without running them while
+ *    still maintaining correct sequential order for signaling fences. So it's
+ *    possible that an entity object is not alive any more while jobs from it
+ *    are still running on the hardware.
+ *
+ * 3. The hardware fence object which is a DMA-fence provided by the driver as
+ *    result of running jobs. Drivers need to make sure that the normal
+ *    DMA-fence semantics are followed for this object. It's important to note
+ *    that the memory for this object can *not* be allocated in the run_job
+ *    callback since that would violate the requirements for the DMA-fence
+ *    implementation. The scheduler maintains a timeout handler which triggers
+ *    if this fence doesn't signal in a configurable time frame.
+ *
+ *    The lifetime of this object follows DMA-fence ref-counting rules, the
+ *    scheduler takes ownership of the reference returned by the driver and
+ *    drops it when it's not needed any more. Errors should also be signaled
+ *    through the hardware fence and are bubbled up back to the scheduler fence
+ *    and entity.
+ *
+ * 4. The scheduler fence object which encapsulates the whole time from pushing
+ *    the job into the scheduler until the hardware has finished processing it.
+ *    This is internally managed by the scheduler, but drivers can grab
+ *    additional reference to it after arming a job. The implementation
+ *    provides DMA-fence interfaces for signaling both scheduling of a command
+ *    submission as well as finishing of processing.
+ *
+ *    The lifetime of this object also follows normal DMA-fence ref-counting
+ *    rules. The finished fence is the one normally exposed outside of the
+ *    scheduler, but the driver can grab references to both the scheduled as
+ *    well as the finished fence when needed for pipe-lining optimizations.
+ *
+ * 5. The run queue object which is a container of entities for a certain
+ *    priority level. The lifetime of those objects are bound to the scheduler
+ *    lifetime.
+ *
+ *    This is internally managed by the scheduler and drivers shouldn't touch
+ *    them directly.
+ *
+ * 6. The scheduler object itself which does the actual work of selecting a job
+ *    and pushing it to the hardware. Both FIFO and RR selection algorithm are
+ *    supported, but FIFO is preferred for many use cases.
+ *
+ *    The lifetime of this object is managed by the driver using it. Before
+ *    destroying the scheduler the driver must ensure that all hardware
+ *    processing involving this scheduler object has finished by calling for
+ *    example disable_irq(). It is *not* sufficient to wait for the hardware
+ *    fence here since this doesn't guarantee that all callback processing has
+ *    finished.
+ *
+ * All callbacks the driver needs to implement are restricted by DMA-fence
+ * signaling rules to guarantee deadlock free forward progress. This especially
+ * means that for normal operation no memory can be allocated. All memory which
+ * is needed for pushing the job to the hardware must be allocated before
+ * arming a job. It also means that no locks can be taken under which memory
+ * might be allocated as well.
+ *
+ * Memory which is optional to allocate for device core dumping or debugging
+ * *must* be allocated with GFP_NOWAIT and appropriate error handling taking if
+ * that allocation fails. GFP_ATOMIC should only be used if absolutely
+ * necessary since dipping into the special atomic reserves is usually not
+ * justified for a GPU driver.
+ *
+ * The scheduler also used to provided functionality for re-submitting jobs
+ * with replacing the hardware fence during reset handling. This functionality
+ * is now marked as deprecated since this has proven to be fundamentally racy
+ * and not compatible with DMA-fence rules and shouldn't be used in any new
+ * code.
  */
 
 #include <linux/kthread.h>
-- 
2.34.1