Hi Christian,
Thanks for sending an update of this patch!
On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
Start to improve the scheduler document. Especially document the
lifetime of each of the objects as well as the restrictions around
DMA-fence handling and userspace compatibility.
v2: Some improvements suggested by Danilo, add section about error
handling.
Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
Documentation/gpu/drm-mm.rst | 36 +++++
drivers/gpu/drm/scheduler/sched_main.c | 174
+++++++++++++++++++++----
2 files changed, 188 insertions(+), 22 deletions(-)
diff --git a/Documentation/gpu/drm-mm.rst
b/Documentation/gpu/drm-mm.rst
index acc5901ac840..112463fa9f3a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,12 +552,48 @@ Overview
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Overview
+Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler and Run Queue Objects
+
Flow Control
------------
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Flow Control
+Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Error and Timeout handling
+
Scheduler Function References
-----------------------------
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 044a8c4875ba..026123497b0e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,122 @@
/**
* DOC: Overview
*
- * The GPU scheduler provides entities which allow userspace to
push jobs
- * into software queues which are then scheduled on a hardware run
queue.
- * The software queues have a priority among them. The scheduler
selects the entities
- * from the run queue using a FIFO. The scheduler provides
dependency handling
- * features among jobs. The driver is supposed to provide callback
functions for
- * backend operations to the scheduler like submitting a job to
hardware run queue,
- * returning the dependencies of a job etc.
- *
- * The organisation of the scheduler is the following:
- *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be
scheduled on
- * the hardware.
- *
- * The jobs in a entity are always scheduled in the order that they
were pushed.
- *
- * Note that once a job was taken from the entities queue and
pushed to the
- * hardware, i.e. the pending queue, the entity must not be
referenced anymore
- * through the jobs entity pointer.
+ * The GPU scheduler implements some logic to decide which command
submission
+ * to push next to the hardware. Another major use case of the GPU
scheduler
+ * is to enforce correct driver behavior around those command
submissions.
+ * Because of this it's also used by drivers which don't need the
actual
+ * scheduling functionality.
This reads a bit like we're already right in the middle of the
documentation.
I'd propose to start with something like "The DRM GPU scheduler
serves as a
common component intended to help drivers to manage command
submissions." And
then mention the different solutions the scheduler provides, e.g.
ordering of
command submissions, dma-fence handling in the context of command
submissions,
etc.
Also, I think it would be good to give a rough overview of the
topology of the
scheduler's components. And since you already mention that the
component is
"also used by drivers which don't need the actual scheduling
functionality", I'd
also mention the special case of a single entity and a single
run-queue per
scheduler.
+ *
+ * All callbacks the driver needs to implement are restricted by
DMA-fence
+ * signaling rules to guarantee deadlock free forward progress.
This especially
+ * means that for normal operation no memory can be allocated in a
callback.
+ * All memory which is needed for pushing the job to the hardware
must be
+ * allocated before arming a job. It also means that no locks can
be taken
+ * under which memory might be allocated as well.
I think that's good. Even though, with the recently merged workqueue
patches,
drivers can actually create a setup where the free_job callback isn't
part of
the fence signalling critical path anymore. But I agree with Sima
that this is
probably too error prone to give drivers ideas. So, this paragraph is
probably
good as it is. :-)
+ *
+ * Memory which is optional to allocate, for example for device
core dumping or
+ * debugging, *must* be allocated with GFP_NOWAIT and appropriate
error
+ * handling taking if that allocation fails. GFP_ATOMIC should only
be used if
+ * absolutely necessary since dipping into the special atomic
reserves is
+ * usually not justified for a GPU driver.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object contains submission dependencies in the form
of DMA-fence
+ * objects. Drivers can also implement an optional prepare_job
callback which
+ * returns additional dependencies as DMA-fence objects. It's
important to note
+ * that this callback can't allocate memory or grab locks under
which memory is
+ * allocated.
+ *
+ * Drivers should use this as base class for an object which
contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object should at least be from pushing
it into the
"should at least last from"
+ * scheduler until the scheduler notes through the free callback
that a job
What about "until the free_job callback has been called and hence the
scheduler
does not require the job object anymore."?
+ * isn't needed any more. Drivers can of course keep their job
object alive
+ * longer than that, but that's outside of the scope of the scheduler
+ * component. Job initialization is split into two parts,
drm_sched_job_init()
+ * and drm_sched_job_arm(). It's important to note that after
arming a job
I suggest to add a brief comment on why job initialization is split up.
+ * drivers must follow the DMA-fence rules and can't easily
allocate memory
+ * or takes locks under which memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object which is a container for jobs which should
execute
+ * sequentially. Drivers should create an entity for each
individual context
+ * they maintain for command submissions which can run in parallel.
+ *
+ * The lifetime of the entity should *not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush
+ * callback. So it's possible that an entity object is not alive any
"Note that it is possible..."
+ * more while jobs from it are still running on the hardware.
"while jobs previously fetched from this entity are still..."
+ *
+ * Background is that for compatibility reasons with existing
+ * userspace all results of a command submission should become visible
+ * externally even after after a process exits. This is normal
POSIX behavior
+ * for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain
executable
+ * shaders enabling processes to evade their termination by
offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the
entity object
+ * makes sure that jobs are freed without running them while still
maintaining
+ * correct sequential order for signaling fences.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a DMA-fence provided by the driver
as result of
+ * running jobs. Drivers need to make sure that the normal
DMA-fence semantics
+ * are followed for this object. It's important to note that the
memory for
+ * this object can *not* be allocated in the run_job callback since
that would
+ * violate the requirements for the DMA-fence implementation. The
scheduler
+ * maintains a timeout handler which triggers if this fence doesn't
signal in
+ * a configurable time frame.
+ *
+ * The lifetime of this object follows DMA-fence ref-counting
rules, the
+ * scheduler takes ownership of the reference returned by the
driver and drops
+ * it when it's not needed any more.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object which encapsulates the whole time
from pushing
+ * the job into the scheduler until the hardware has finished
processing it.
+ * This is internally managed by the scheduler, but drivers can
grab additional
+ * reference to it after arming a job. The implementation provides
DMA-fence
+ * interfaces for signaling both scheduling of a command submission
as well as
+ * finishing of processing.
+ *
+ * The lifetime of this object also follows normal DMA-fence
ref-counting
+ * rules. The finished fence is the one normally exposed outside of
the
+ * scheduler, but the driver can grab references to both the
scheduled as well
+ * as the finished fence when needed for pipe-lining optimizations.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself does the actual work of selecting a
job and
+ * pushing it to the hardware. Both FIFO and RR selection algorithm
are
+ * supported, but FIFO is preferred for many use cases.
I suggest to name the use cases FIFO scheduling is preferred for and
why.
If, instead, it's just a general recommendation, I also suggest to
explain why.
+ *
+ * The lifetime of the scheduler is managed by the driver using it.
Before
+ * destroying the scheduler the driver must ensure that all
hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware
fence here
+ * since this doesn't guarantee that all callback processing has
finished.
This is the part I'm most concerned about, since I feel like we leave
drivers
"up in the air" entirely. Hence, I think here we need to be more
verbose and
detailed about the options drivers have to ensure that.
For instance, let's assume we have the single-entity-per-scheduler
topology
because the driver only uses the GPU scheduler to feed a firmware
scheduler with
dynamically allocated ring buffers.
In this case the entity, scheduler and ring buffer are bound to the
lifetime of
a userspace process.
What do we expect the driver to do if the userspace process is
killed? As you
mentioned, only waiting for the ring to be idle (which implies all HW
fences
are signalled) is not enough. This doesn't guarantee all the free_job()
callbacks have been called yet and hence stopping the scheduler
before the
pending_list is actually empty would leak the memory of the jobs on the
pending_list waiting to be freed.
I already brought this up when we were discussing Matt's Xe inspired
scheduler
patch series and it seems there was no interest to provide drivers
with some
common mechanism that gurantees that the pending_list is empty.
Hence, I really
think we should at least give recommendations how drivers should deal
with that.
+ *
+ * The run queue object is a container of entities for a certain
priority
+ * level. This object is internally managed by the scheduler and
drivers
+ * shouldn't touch them directly. The lifetime of run queues are
bound to the
+ * schedulers lifetime.
I think we should also mention that we support a variable number of
run-queues
up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction
on which
priorities a driver can use when choosing less than
DRM_SCHED_PRIORITY_COUNT
run-queues.
For instance, initializing the scheduler with a single run-queue,
requires the
corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise
we'll just
fault since the priority is also used as an array index into
sched->sched_rq[].
*/
/**
@@ -72,6 +166,42 @@
* limit.
*/
+/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors schould be signaled by using dma_fence_set_error() on the
hardware
+ * fence object before signaling it. Errors are then bubbled up
from the
+ * hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission
using the
+ * drm_sched_entity_error() function which can be used to cancel
queued
+ * submissions in the run_job callback as well as preventing
pushing further
+ * ones into the entity in the drivers submission function.
+ *
+ * When the hardware fence fails to signal in a configurable amount
of time the
+ * timedout_job callback is issued. The driver should then follow
the procedure
+ * described on the &struct drm_sched_backend_ops.timedout_job
callback (TODO:
+ * The timeout handler should probably switch to using the hardware
fence as
+ * parameter instead of the job. Otherwise the handling will always
race
+ * between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for
re-submitting jobs
+ * with replacing the hardware fence during reset handling. This
functionality
+ * is now marked as deprecated. This has proven to be fundamentally
racy and
+ * not compatible with DMA-fence rules and shouldn't be used in any
new code.
+ *
+ * Additional there is the drm_sched_increase_karma() function
which tries to
"Additionally"
+ * find the entity which submitted a job and increases it's 'karma'
+ * atomic variable to prevent re-submitting jobs from this entity.
This has
+ * quite some overhead and re-submitting jobs is now marked as
deprecated. So
+ * using this function is rather discouraged.
+ *
+ * Drivers can still re-create the GPU state should it be lost
during timeout
+ * handling when they can guarantee that forward progress is made
and this
+ * doesn't cause another timeout. But this is strongly hardware
specific and
+ * out of the scope of the general GPU scheduler.
+ */
+
#include <linux/wait.h>
#include <linux/sched.h>
#include <linux/completion.h>
--
2.34.1