Re: [PATCH] drm/scheduler: improve GPU scheduler documentation

Christian König <christian.koenig@xxxxxxx> · Mon, 13 Nov 2023 14:20:56 +0100

Am 13.11.23 um 14:14 schrieb Danilo Krummrich:
Hi Christian,

On 11/13/23 13:38, Christian König wrote:
Start to improve the scheduler document. Especially document the
lifetime of each of the objects as well as the restrictions around
DMA-fence handling and userspace compatibility.

Thanks a lot for submitting this - it's very much appreciated!

Before reviewing in detail, do you mind to re-structure this a little bit?

Not the slightest. I'm not a native speaker of English and generally not 
very good at writing documentation.

Instead
of packing everything in an enumeration I'd suggest to have separate 
DOC paragraphs.

For instance:

- keep "Overview" to introduce the overall idea and basic structures 
of the component
- a paragraph for each of those basic structures (drm_gpu_scheduler, 
drm_sched_entity,
  drm_sched_fence) explaining their purpose and lifetime
- a paragraph about the pitfalls dealing with DMA fences
- a paragraph about the pitfalls of the driver callbacks (although 
this might highly
  intersect with the previous suggested one)

I feel like this would be much easier to read.

Going to give that a try.


Besides that, which covers the conceptual side of things, I think we 
also need to
improve the documentation on what the scheduler implementation expects 
from drivers,
e.g. zero initialize structures, valid initialization parameters for 
typical use cases,
etc. However, that's for a separate patch.

Yeah, each individual function should have kerneldoc attached to it.

I think we should also try to deprecate more of the hacks AMD came up. 
Especially the error and GPU reset handling is more than messed up.

Regards,
Christian.


- Danilo


Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 126 ++++++++++++++++++++-----
  1 file changed, 104 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 506371c42745..36a7c5dc852d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,110 @@
  /**
   * DOC: Overview
   *
- * The GPU scheduler provides entities which allow userspace to push 
jobs
- * into software queues which are then scheduled on a hardware run 
queue.
- * The software queues have a priority among them. The scheduler 
selects the entities
- * from the run queue using a FIFO. The scheduler provides 
dependency handling
- * features among jobs. The driver is supposed to provide callback 
functions for
- * backend operations to the scheduler like submitting a job to 
hardware run queue,
- * returning the dependencies of a job etc.
- *
- * The organisation of the scheduler is the following:
- *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be 
scheduled on
- *    the hardware.
- *
- * The jobs in a entity are always scheduled in the order that they 
were pushed.
- *
- * Note that once a job was taken from the entities queue and pushed 
to the
- * hardware, i.e. the pending queue, the entity must not be 
referenced anymore
- * through the jobs entity pointer.
+ * The GPU scheduler implements some logic to decide which command 
submission
+ * to push next to the hardware. Another major use case for the GPU 
scheduler
+ * is to enforce correct driver behavior around those command 
submission.
+ * Because of this it's also used by drivers which don't need the 
actual
+ * scheduling functionality.
+ *
+ * To fulfill this task the GPU scheduler uses of the following 
objects:
+ *
+ * 1. The job object which contains a bunch of dependencies in the 
form of
+ *    DMA-fence objects. Drivers can also implement an optional 
prepare_job
+ *    callback which returns additional dependencies as DMA-fence 
objects.
+ *    It's important to note that this callback must follow the 
DMA-fence rules,
+ *    so it can't easily allocate memory or grab locks under which 
memory is
+ *    allocated. Drivers should use this as base class for an object 
which
+ *    contains the necessary state to push the command submission to 
the
+ *    hardware.
+ *
+ *    The lifetime of the job object should at least be from pushing 
it into the
+ *    scheduler until the scheduler notes through the free callback 
that a job
+ *    isn't needed any more. Drivers can of course keep their job 
object alive
+ *    longer than that, but that's outside of the scope of the 
scheduler
+ *    component. Job initialization is split into two parts,
+ *    drm_sched_job_init() and drm_sched_job_arm(). It's important 
to note that
+ *    after arming a job drivers must follow the DMA-fence rules and 
can't
+ *    easily allocate memory or takes locks under which memory is 
allocated.
+ *
+ * 2. The entity object which is a container for jobs which should 
execute
+ *    sequentially. Drivers should create an entity for each 
individual context
+ *    they maintain for command submissions which can run in parallel.
+ *
+ *    The lifetime of the entity should *not* exceed the lifetime of 
the
+ *    userspace process it was created for and drivers should call the
+ *    drm_sched_entity_flush() function from their 
file_operations.flush
+ *    callback. Background is that for compatibility reasons with 
existing
+ *    userspace all results of a command submission should become 
visible
+ *    externally even after after a process exits. The only 
exception to that
+ *    is when the process is actively killed by a SIGKILL. In this 
case the
+ *    entity object makes sure that jobs are freed without running 
them while
+ *    still maintaining correct sequential order for signaling 
fences. So it's
+ *    possible that an entity object is not alive any more while 
jobs from it
+ *    are still running on the hardware.
+ *
+ * 3. The hardware fence object which is a DMA-fence provided by the 
driver as
+ *    result of running jobs. Drivers need to make sure that the normal
+ *    DMA-fence semantics are followed for this object. It's 
important to note
+ *    that the memory for this object can *not* be allocated in the 
run_job
+ *    callback since that would violate the requirements for the 
DMA-fence
+ *    implementation. The scheduler maintains a timeout handler 
which triggers
+ *    if this fence doesn't signal in a configurable time frame.
+ *
+ *    The lifetime of this object follows DMA-fence ref-counting 
rules, the
+ *    scheduler takes ownership of the reference returned by the 
driver and
+ *    drops it when it's not needed any more. Errors should also be 
signaled
+ *    through the hardware fence and are bubbled up back to the 
scheduler fence
+ *    and entity.
+ *
+ * 4. The scheduler fence object which encapsulates the whole time 
from pushing
+ *    the job into the scheduler until the hardware has finished 
processing it.
+ *    This is internally managed by the scheduler, but drivers can grab
+ *    additional reference to it after arming a job. The implementation
+ *    provides DMA-fence interfaces for signaling both scheduling of 
a command
+ *    submission as well as finishing of processing.
+ *
+ *    The lifetime of this object also follows normal DMA-fence 
ref-counting
+ *    rules. The finished fence is the one normally exposed outside 
of the
+ *    scheduler, but the driver can grab references to both the 
scheduled as
+ *    well as the finished fence when needed for pipe-lining 
optimizations.
+ *
+ * 5. The run queue object which is a container of entities for a 
certain
+ *    priority level. The lifetime of those objects are bound to the 
scheduler
+ *    lifetime.
+ *
+ *    This is internally managed by the scheduler and drivers 
shouldn't touch
+ *    them directly.
+ *
+ * 6. The scheduler object itself which does the actual work of 
selecting a job
+ *    and pushing it to the hardware. Both FIFO and RR selection 
algorithm are
+ *    supported, but FIFO is preferred for many use cases.
+ *
+ *    The lifetime of this object is managed by the driver using it. 
Before
+ *    destroying the scheduler the driver must ensure that all hardware
+ *    processing involving this scheduler object has finished by 
calling for
+ *    example disable_irq(). It is *not* sufficient to wait for the 
hardware
+ *    fence here since this doesn't guarantee that all callback 
processing has
+ *    finished.
+ *
+ * All callbacks the driver needs to implement are restricted by 
DMA-fence
+ * signaling rules to guarantee deadlock free forward progress. This 
especially
+ * means that for normal operation no memory can be allocated. All 
memory which
+ * is needed for pushing the job to the hardware must be allocated 
before
+ * arming a job. It also means that no locks can be taken under 
which memory
+ * might be allocated as well.
+ *
+ * Memory which is optional to allocate for device core dumping or 
debugging
+ * *must* be allocated with GFP_NOWAIT and appropriate error 
handling taking if
+ * that allocation fails. GFP_ATOMIC should only be used if absolutely
+ * necessary since dipping into the special atomic reserves is 
usually not
+ * justified for a GPU driver.
+ *
+ * The scheduler also used to provided functionality for 
re-submitting jobs
+ * with replacing the hardware fence during reset handling. This 
functionality
+ * is now marked as deprecated since this has proven to be 
fundamentally racy
+ * and not compatible with DMA-fence rules and shouldn't be used in 
any new
+ * code.
   */
    #include <linux/kthread.h>