Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Christian König <christian.koenig@xxxxxxx> · Thu, 6 Apr 2023 12:04:48 +0200

Am 05.04.23 um 20:53 schrieb Matthew Brost:
On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:

	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment

No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.

You can still have someone depend on the LR fence and cause a deadlock 
without waiting on it.

See my attempted solution to this problem. It tries to inherit the LR 
flag of a fence when something depended on it.

For example if you create a fence container and one of the fences inside 
the container is a LR fence then the container itself would be an LR 
fence as well.

	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.

We use user space fence for syncs.

	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.

This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Well I don't think that this will be acceptable. Especially if you not 
only have long running submission, but things like page faults/HMM in 
those jobs.

Regards,
Christian.

Matt

Thanks,
Oak

-----Original Message-----
From: Christian König <christian.koenig@xxxxxxx>
Sent: April 5, 2023 3:30 AM
To: Brost, Matthew <matthew.brost@xxxxxxxxx>; Zeng, Oak
<oak.zeng@xxxxxxxxx>
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx;
robdclark@xxxxxxxxxxxx; thomas.hellstrom@xxxxxxxxxxxxxxx; airlied@xxxxxxxx;
lina@xxxxxxxxxxxxx; boris.brezillon@xxxxxxxxxxxxx; faith.ekstrand@xxxxxxxxxxxxx
Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans

Am 04.04.23 um 20:08 schrieb Matthew Brost:
On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload.
Why you want to do this in the first place? What is the benefit? Drm scheduler is
pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
level, as you said below for intel this is Guc. Can xe driver just directly submit job
to Guc, bypassing drm scheduler?
If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.
Well exactly that's the point: Why?

As far as I can see that are two completely distinct use cases, so you
absolutely do want two completely distinct implementations for this.

2. using dma-fence for long run workload: I am well aware that page fault (and
the consequent memory allocation/lock acquiring to fix the fault) can cause
deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
used purely because the nature of the workload that it runs very long (indefinite).
I did a math: the dma_fence_wait_timeout function's third param is the timeout
which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
enough, can we just change the timeout parameter to signed 64 bits so it is much
longer than our life time...
So I mainly argue we can't use dma-fence for long-run workload is not
because the workload runs very long, rather because of the fact that we use
page fault for long-run workload. If we enable page fault for short-run workload,
we can't use dma-fence either. Page fault is the key thing here.
Now since we use page fault which is *fundamentally* controversial with
dma-fence design, why now just introduce a independent concept such as user-
fence instead of extending existing dma-fence?
I like unified design. If drm scheduler, dma-fence can be extended to work for
everything, it is beautiful. But seems we have some fundamental problem here.
Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
the signal / CB infrastructure) and enforce we don't use use these
dma-fences from the scheduler in memory reclaim paths or export these to
user space or other drivers. Think of this mode as SW only fence.
Yeah and I truly think this is an really bad idea.

The signal/CB infrastructure in the dma_fence turned out to be the
absolutely nightmare I initially predicted. Sorry to say that, but in
this case the "I've told you so" is appropriate in my opinion.

If we need infrastructure for long running dependency tracking we should
encapsulate that in a new framework and not try to mangle the existing
code for something it was never intended for.

Christian.

Matt

Thanks,
Oak

-----Original Message-----
From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
Matthew Brost
Sent: April 3, 2023 8:22 PM
To: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx
Cc: robdclark@xxxxxxxxxxxx; thomas.hellstrom@xxxxxxxxxxxxxxx;
airlied@xxxxxxxx;
lina@xxxxxxxxxxxxx; boris.brezillon@xxxxxxxxxxxxx; Brost, Matthew
<matthew.brost@xxxxxxxxx>; christian.koenig@xxxxxxx;
faith.ekstrand@xxxxxxxxxxxxx
Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans
Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
[5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
    drm/sched: Convert drm scheduler to use a work queue rather than
      kthread
    drm/sched: Move schedule policy to scheduler / entity
    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
    drm/sched: Add generic scheduler message interface
    drm/sched: Start run wq before TDR in drm_sched_start
    drm/sched: Submit job before starting TDR
    drm/sched: Add helper to set TDR timeout
    drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
    dma-buf/dma-fence: Introduce long-running completion fences
    drm/sched: Support long-running sched entities

   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
   drivers/dma-buf/dma-resv.c                  |   5 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
   drivers/gpu/drm/drm_syncobj.c               |   5 +-
   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
   include/drm/gpu_scheduler.h                 | 130 +++++++--
   include/linux/dma-fence.h                   |  60 ++++-
   16 files changed, 649 insertions(+), 184 deletions(-)

--
2.34.1