Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 05.04.23 um 14:45 schrieb Daniel Vetter:
On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:
Am 05.04.23 um 14:35 schrieb Thomas Hellström:
Hi,

On 4/4/23 21:25, Daniel Vetter wrote:
On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
(Intel) wrote:
On 4/4/23 15:10, Christian König wrote:
Am 04.04.23 um 14:54 schrieb Thomas Hellström:
Hi, Christian,

On 4/4/23 11:09, Christian König wrote:
Am 04.04.23 um 02:22 schrieb Matthew Brost:
From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>

For long-running workloads, drivers either need to open-code
completion
waits, invent their own synchronization
primitives or internally use
dma-fences that do not obey the cross-driver
dma-fence protocol, but
without any lockdep annotation all these
approaches are error prone.

So since for example the drm scheduler uses dma-fences it is
desirable for
a driver to be able to use it for throttling and error
handling also with
internal dma-fences tha do not obey the cros-driver
dma-fence protocol.

Introduce long-running completion fences in form of
dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the
helper adding
     a callback sign off on that it is aware
that the dma-fence may not
     complete anytime soon. Typically this will be the
scheduler chaining
     a new long-running fence on another one.
Well that's pretty much what I tried before:
https://lwn.net/Articles/893704/

I don't think this quite the same, this explictly enforces that
we don't
break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series
tried to
export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs
pending
for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR
jobs /w the
DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.
The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that
somehow the
kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace
standards
committees would like it to), and as long as it doesn't have a
real-world
perf impact it doesn't really matter why we end up blocking in the
submit
ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.
So it seems the discussion around the long-running synchronization
diverged a bit between threads and this thread was hijacked for
preempt-fences and userptr.

Do I understand it correctly that the recommendation from both Daniel
and Christian is to *not* use the drm scheduler for long-running compute
jobs, but track any internal dma-fence dependencies (pipelined clearing
or whatever) in a separate mechanism and handle unresolved dependencies
on other long-running jobs using -EWOULDBLOCK?
Yeah, I think that's a good summary.

If needed we could extract some scheduler functionality into separate
components, but the fundamental problem is that to the GPU scheduler
provides a dma_fence interface to the outside to signal job completion and
Daniel and I seem to agree that you really don't want that.
I think I'm on something slightly different:

- For anything which semantically is not a dma_fence I agree it probably
   should be handled with EWOULDBLOCK and passed to userspace. Either with
   a submit thread or userspace memory fences. Note that in practice you
   will have a bunch of blocking left in the ioctl, stuff like mutexes or
   memory allocations when things get really tight and you end up in
   synchronous reclaim. Not any different from userspace ending up in
   synchronous reclaim due to a page fault really. Trying to shoehorn
   userspace memory fences or anything else long-running into drm/sched
   dependency handling is just way too much a can of worms.

- For the memory management dependencies, which are all dma_fence when
   pipeline, I do think pushing them through the drm/sched makes sense. It
   has all the stuff to handle that already, plus it's imo also the ideal
   place to handle the preempt-ctx dma_fence scaffolding/semantics. Which
   would give you a really neatly unified command submission interface
   since in both cases (end-of-batch and long-running) you fish the
   dma_fence you need to stuff in all the right dma_resv object (for memory
   management purpose) out of the same place: The drm_sched_job struct.

So I'm _not_ on the "do not use drm/sched for long-running jobs at all".
That doesn't make much sense to me because you'll just reinventing the
exact same dma_fence dependency handling and memory management shuffling
we already have. That seems silly.

How about we stuff the functionality we still want to have into a drm_job object?

I mean that really isn't that much, basically just looking at drm_syncobj, dma_resv etc.. and extracting all the dependencies.

Christian.

-Daniel

Regards,
Christian.

Thanks,
Thomas





And the reasons why it was rejected haven't changed.

Regards,
Christian.

Yes, TBH this was mostly to get discussion going how we'd best
tackle this problem while being able to reuse the scheduler for
long-running workloads.

I couldn't see any clear decision on your series, though, but one
main difference I see is that this is intended for driver-internal
use only. (I'm counting using the drm_scheduler as a helper for
driver-private use). This is by no means a way to try tackle the
indefinite fence problem.
Well this was just my latest try to tackle this, but essentially the
problems are the same as with your approach: When we express such
operations as dma_fence there is always the change that we leak that
somewhere.

My approach of adding a flag noting that this operation
is dangerous and
can't be synced with something memory management depends on tried to
contain this as much as possible, but Daniel still pretty clearly
rejected it (for good reasons I think).

We could ofc invent a completely different data-type that abstracts
the synchronization the scheduler needs in the long-running case, or
each driver could hack something up, like sleeping in the
prepare_job() or run_job() callback for throttling, but those waits
should still be annotated in one way or annotated one way or another
(and probably in a similar way across drivers) to make sure we don't
do anything bad.

   So any suggestions as to what would be the better solution here
would be appreciated.
Mhm, do we really the the GPU scheduler for that?

I think we need to solve this within the DRM scheduler one way or
another.
Yeah so if we conclude that the queue really must be bottomless then I
agree drm-sched should help out sort out the mess. Because I'm guessing
that every driver will have this issue. But that's a big if.

I guess if we teach the drm scheduler that some jobs are fairly endless
then maybe it wouldn't be too far-fetched to also teach it to wait for a
previous one to finish (but not with the dma_fence that preempts,
which we
put into the dma_resv for memory management, but some other struct
completion). The scheduler already has a concept of not stuffing too
much
stuff into the same queue after all, so this should fit?
-Daniel


I mean in the 1 to 1 case  you basically just need a component which
collects the dependencies as dma_fence and if all of
them are fulfilled
schedules a work item.

As long as the work item itself doesn't produce a
dma_fence it can then
still just wait for other none dma_fence dependencies.

Then the work function could submit the work and wait for the result.

The work item would then pretty much represent what you want, you can
wait for it to finish and pass it along as long running dependency.

Maybe give it a funky name and wrap it up in a structure, but that's
basically it.

This very much sounds like a i915_sw_fence for the
dependency tracking and
dma_fence_work for the actual work although it's completion fence is a
dma_fence.

Agree this does sound to i915ish as stated below one of mandates in Xe
was to use the DRM scheduler. Beyond that as someone who a submission
backend in the i915 and Xe, I love how the DRM scheduler works (single
entry point), it makes everything so much easier.

Matt

Although that goes against the whole idea of a condition for
merging the xe
driver would be that we implement some sort of minimal scaffolding for
long-running workloads in the drm scheduler, and the
thinking behind that is
to avoid implementing intel-specific solutions like those...

Thanks,

Thomas



Regards,
Christian.

Thanks,

Thomas









[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux