Re: [PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2

Christian König <christian.koenig@xxxxxxx> · Wed, 14 Feb 2024 09:23:00 +0100

Am 13.02.24 um 18:37 schrieb Danilo Krummrich:
Hi Christian,

What's the status of this effort? Was there ever a follow-up?

It's unfortunately on hold for the moment since I have to look into some 
internal things with highest priority. No idea when this will calm down 
again.

Christian.


- Danilo

On 11/16/23 23:23, Danilo Krummrich wrote:
Hi Christian,

Thanks for sending an update of this patch!

On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
Start to improve the scheduler document. Especially document the
lifetime of each of the objects as well as the restrictions around
DMA-fence handling and userspace compatibility.

v2: Some improvements suggested by Danilo, add section about error
     handling.

Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
  Documentation/gpu/drm-mm.rst           |  36 +++++
  drivers/gpu/drm/scheduler/sched_main.c | 174 
+++++++++++++++++++++----
  2 files changed, 188 insertions(+), 22 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst 
b/Documentation/gpu/drm-mm.rst
index acc5901ac840..112463fa9f3a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,12 +552,48 @@ Overview
  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
     :doc: Overview
  +Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler and Run Queue Objects
+
  Flow Control
  ------------
    .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
     :doc: Flow Control
  +Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Error and Timeout handling
+
  Scheduler Function References
  -----------------------------
  diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 044a8c4875ba..026123497b0e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,122 @@
  /**
   * DOC: Overview
   *
- * The GPU scheduler provides entities which allow userspace to 
push jobs
- * into software queues which are then scheduled on a hardware run 
queue.
- * The software queues have a priority among them. The scheduler 
selects the entities
- * from the run queue using a FIFO. The scheduler provides 
dependency handling
- * features among jobs. The driver is supposed to provide callback 
functions for
- * backend operations to the scheduler like submitting a job to 
hardware run queue,
- * returning the dependencies of a job etc.
- *
- * The organisation of the scheduler is the following:
- *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be 
scheduled on
- *    the hardware.
- *
- * The jobs in a entity are always scheduled in the order that they 
were pushed.
- *
- * Note that once a job was taken from the entities queue and 
pushed to the
- * hardware, i.e. the pending queue, the entity must not be 
referenced anymore
- * through the jobs entity pointer.
+ * The GPU scheduler implements some logic to decide which command 
submission
+ * to push next to the hardware. Another major use case of the GPU 
scheduler
+ * is to enforce correct driver behavior around those command 
submissions.
+ * Because of this it's also used by drivers which don't need the 
actual
+ * scheduling functionality.

This reads a bit like we're already right in the middle of the 
documentation.
I'd propose to start with something like "The DRM GPU scheduler 
serves as a
common component intended to help drivers to manage command 
submissions." And
then mention the different solutions the scheduler provides, e.g. 
ordering of
command submissions, dma-fence handling in the context of command 
submissions,
etc.

Also, I think it would be good to give a rough overview of the 
topology of the
scheduler's components. And since you already mention that the 
component is
"also used by drivers which don't need the actual scheduling 
functionality", I'd
also mention the special case of a single entity and a single 
run-queue per
scheduler.

+ *
+ * All callbacks the driver needs to implement are restricted by 
DMA-fence
+ * signaling rules to guarantee deadlock free forward progress. 
This especially
+ * means that for normal operation no memory can be allocated in a 
callback.
+ * All memory which is needed for pushing the job to the hardware 
must be
+ * allocated before arming a job. It also means that no locks can 
be taken
+ * under which memory might be allocated as well.

I think that's good. Even though, with the recently merged workqueue 
patches,
drivers can actually create a setup where the free_job callback isn't 
part of
the fence signalling critical path anymore. But I agree with Sima 
that this is
probably too error prone to give drivers ideas. So, this paragraph is 
probably
good as it is. :-)

+ *
+ * Memory which is optional to allocate, for example for device 
core dumping or
+ * debugging, *must* be allocated with GFP_NOWAIT and appropriate 
error
+ * handling taking if that allocation fails. GFP_ATOMIC should only 
be used if
+ * absolutely necessary since dipping into the special atomic 
reserves is
+ * usually not justified for a GPU driver.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object contains submission dependencies in the form 
of DMA-fence
+ * objects. Drivers can also implement an optional prepare_job 
callback which
+ * returns additional dependencies as DMA-fence objects. It's 
important to note
+ * that this callback can't allocate memory or grab locks under 
which memory is
+ * allocated.
+ *
+ * Drivers should use this as base class for an object which 
contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object should at least be from pushing 
it into the

"should at least last from"

+ * scheduler until the scheduler notes through the free callback 
that a job

What about "until the free_job callback has been called and hence the 
scheduler
does not require the job object anymore."?

+ * isn't needed any more. Drivers can of course keep their job 
object alive
+ * longer than that, but that's outside of the scope of the scheduler
+ * component. Job initialization is split into two parts, 
drm_sched_job_init()
+ * and drm_sched_job_arm(). It's important to note that after 
arming a job

I suggest to add a brief comment on why job initialization is split up.

+ * drivers must follow the DMA-fence rules and can't easily 
allocate memory
+ * or takes locks under which memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object which is a container for jobs which should 
execute
+ * sequentially. Drivers should create an entity for each 
individual context
+ * they maintain for command submissions which can run in parallel.
+ *
+ * The lifetime of the entity should *not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush
+ * callback. So it's possible that an entity object is not alive any

"Note that it is possible..."

+ * more while jobs from it are still running on the hardware.

"while jobs previously fetched from this entity are still..."

+ *
+ * Background is that for compatibility reasons with existing
+ * userspace all results of a command submission should become visible
+ * externally even after after a process exits. This is normal 
POSIX behavior
+ * for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain 
executable
+ * shaders enabling processes to evade their termination by 
offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the 
entity object
+ * makes sure that jobs are freed without running them while still 
maintaining
+ * correct sequential order for signaling fences.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a DMA-fence provided by the driver 
as result of
+ * running jobs. Drivers need to make sure that the normal 
DMA-fence semantics
+ * are followed for this object. It's important to note that the 
memory for
+ * this object can *not* be allocated in the run_job callback since 
that would
+ * violate the requirements for the DMA-fence implementation. The 
scheduler
+ * maintains a timeout handler which triggers if this fence doesn't 
signal in
+ * a configurable time frame.
+ *
+ * The lifetime of this object follows DMA-fence ref-counting 
rules, the
+ * scheduler takes ownership of the reference returned by the 
driver and drops
+ * it when it's not needed any more.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object which encapsulates the whole time 
from pushing
+ * the job into the scheduler until the hardware has finished 
processing it.
+ * This is internally managed by the scheduler, but drivers can 
grab additional
+ * reference to it after arming a job. The implementation provides 
DMA-fence
+ * interfaces for signaling both scheduling of a command submission 
as well as
+ * finishing of processing.
+ *
+ * The lifetime of this object also follows normal DMA-fence 
ref-counting
+ * rules. The finished fence is the one normally exposed outside of 
the
+ * scheduler, but the driver can grab references to both the 
scheduled as well
+ * as the finished fence when needed for pipe-lining optimizations.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself does the actual work of selecting a 
job and
+ * pushing it to the hardware. Both FIFO and RR selection algorithm 
are
+ * supported, but FIFO is preferred for many use cases.

I suggest to name the use cases FIFO scheduling is preferred for and 
why.

If, instead, it's just a general recommendation, I also suggest to 
explain why.

+ *
+ * The lifetime of the scheduler is managed by the driver using it. 
Before
+ * destroying the scheduler the driver must ensure that all 
hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware 
fence here
+ * since this doesn't guarantee that all callback processing has 
finished.

This is the part I'm most concerned about, since I feel like we leave 
drivers
"up in the air" entirely. Hence, I think here we need to be more 
verbose and
detailed about the options drivers have to ensure that.

For instance, let's assume we have the single-entity-per-scheduler 
topology
because the driver only uses the GPU scheduler to feed a firmware 
scheduler with
dynamically allocated ring buffers.

In this case the entity, scheduler and ring buffer are bound to the 
lifetime of
a userspace process.

What do we expect the driver to do if the userspace process is 
killed? As you
mentioned, only waiting for the ring to be idle (which implies all HW 
fences
are signalled) is not enough. This doesn't guarantee all the free_job()
callbacks have been called yet and hence stopping the scheduler 
before the
pending_list is actually empty would leak the memory of the jobs on the
pending_list waiting to be freed.

I already brought this up when we were discussing Matt's Xe inspired 
scheduler
patch series and it seems there was no interest to provide drivers 
with some
common mechanism that gurantees that the pending_list is empty. 
Hence, I really
think we should at least give recommendations how drivers should deal 
with that.

+ *
+ * The run queue object is a container of entities for a certain 
priority
+ * level. This object is internally managed by the scheduler and 
drivers
+ * shouldn't touch them directly. The lifetime of run queues are 
bound to the
+ * schedulers lifetime.

I think we should also mention that we support a variable number of 
run-queues
up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction 
on which
priorities a driver can use when choosing less than 
DRM_SCHED_PRIORITY_COUNT
run-queues.

For instance, initializing the scheduler with a single run-queue, 
requires the
corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise 
we'll just
fault since the priority is also used as an array index into 
sched->sched_rq[].

   */
    /**
@@ -72,6 +166,42 @@
   * limit.
   */
  +/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors schould be signaled by using dma_fence_set_error() on the 
hardware
+ * fence object before signaling it. Errors are then bubbled up 
from the
+ * hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission 
using the
+ * drm_sched_entity_error() function which can be used to cancel 
queued
+ * submissions in the run_job callback as well as preventing 
pushing further
+ * ones into the entity in the drivers submission function.
+ *
+ * When the hardware fence fails to signal in a configurable amount 
of time the
+ * timedout_job callback is issued. The driver should then follow 
the procedure
+ * described on the &struct drm_sched_backend_ops.timedout_job 
callback (TODO:
+ * The timeout handler should probably switch to using the hardware 
fence as
+ * parameter instead of the job. Otherwise the handling will always 
race
+ * between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for 
re-submitting jobs
+ * with replacing the hardware fence during reset handling. This 
functionality
+ * is now marked as deprecated. This has proven to be fundamentally 
racy and
+ * not compatible with DMA-fence rules and shouldn't be used in any 
new code.
+ *
+ * Additional there is the drm_sched_increase_karma() function 
which tries to
  "Additionally"

+ * find the entity which submitted a job and increases it's 'karma'
+ * atomic variable to prevent re-submitting jobs from this entity. 
This has
+ * quite some overhead and re-submitting jobs is now marked as 
deprecated. So
+ * using this function is rather discouraged.
+ *
+ * Drivers can still re-create the GPU state should it be lost 
during timeout
+ * handling when they can guarantee that forward progress is made 
and this
+ * doesn't cause another timeout. But this is strongly hardware 
specific and
+ * out of the scope of the general GPU scheduler.
+ */
+
  #include <linux/wait.h>
  #include <linux/sched.h>
  #include <linux/completion.h>
--
2.34.1