Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Matthew Brost <matthew.brost@xxxxxxxxx> · Tue, 4 Apr 2023 20:46:55 +0000

On Tue, Apr 04, 2023 at 10:31:58PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 08:19:37PM +0000, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> > > On Tue, 4 Apr 2023 at 22:04, Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > > > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@xxxxxxx> wrote:
> > > > > >
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > >
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > >>> From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > > > >>>
> > > > > > >>> For long-running workloads, drivers either need to open-code completion
> > > > > > >>> waits, invent their own synchronization primitives or internally use
> > > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > >>> without any lockdep annotation all these approaches are error prone.
> > > > > > >>>
> > > > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > > > >>> desirable for
> > > > > > >>> a driver to be able to use it for throttling and error handling also
> > > > > > >>> with
> > > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > > > > > >>>
> > > > > > >>> Introduce long-running completion fences in form of dma-fences, and add
> > > > > > >>> lockdep annotation for them. In particular:
> > > > > > >>>
> > > > > > >>> * Do not allow waiting under any memory management locks.
> > > > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > > > >>> adding
> > > > > > >>>    a callback sign off on that it is aware that the dma-fence may not
> > > > > > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > > > > > >>>    a new long-running fence on another one.
> > > > > > >>
> > > > > > >> Well that's pretty much what I tried before:
> > > > > > >> https://lwn.net/Articles/893704/
> > > > > > >>
> > > > > > >> And the reasons why it was rejected haven't changed.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Christian.
> > > > > > >>
> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > > > > this problem while being able to reuse the scheduler for long-running
> > > > > > > workloads.
> > > > > > >
> > > > > > > I couldn't see any clear decision on your series, though, but one main
> > > > > > > difference I see is that this is intended for driver-internal use
> > > > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > indefinite fence problem.
> > > > > >
> > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > problems are the same as with your approach: When we express such
> > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > somewhere.
> > > > > >
> > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > can't be synced with something memory management depends on tried to
> > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > rejected it (for good reasons I think).
> > > > >
> > > > > Yeah I still don't like dma_fence that somehow have totally different
> > > > > semantics in that critical piece of "will it complete or will it
> > > > > deadlock?" :-)
> > > >
> > > > Not going to touch LR dma-fences in this reply, I think we can continue
> > > > the LR fence discussion of the fork of this thread I just responded to.
> > > > Have a response the preempt fence discussion below.
> > > >
> > > > > >
> > > > > > >
> > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > do anything bad.
> > > > > > >
> > > > > > >  So any suggestions as to what would be the better solution here would
> > > > > > > be appreciated.
> > > > > >
> > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > >
> > > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > > > schedules a work item.
> > > > > >
> > > > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > > > still just wait for other none dma_fence dependencies.
> > > > >
> > > > > Yeah that's the important thing, for long-running jobs dependencies as
> > > > > dma_fence should be totally fine. You're just not allowed to have any
> > > > > outgoing dma_fences at all (except the magic preemption fence).
> > > > >
> > > > > > Then the work function could submit the work and wait for the result.
> > > > > >
> > > > > > The work item would then pretty much represent what you want, you can
> > > > > > wait for it to finish and pass it along as long running dependency.
> > > > > >
> > > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > > basically it.
> > > > >
> > > > > Like do we need this? If the kernel ever waits for a long-running
> > > > > compute job to finnish I'd call that a bug. Any functional
> > > > > dependencies between engines or whatever are userspace's problem only,
> > > > > which it needs to sort out using userspace memory fences.
> > > > >
> > > > > The only things the kernel needs are some way to track dependencies as
> > > > > dma_fence (because memory management move the memory away and we need
> > > > > to move it back in, ideally pipelined). And it needs the special
> > > > > preempt fence (if we don't have pagefaults) so that you have a fence
> > > > > to attach to all the dma_resv for memory management purposes. Now the
> > > > > scheduler already has almost all the pieces (at least if we assume
> > > > > there's some magic fw which time-slices these contexts on its own),
> > > > > and we just need a few minimal changes:
> > > > > - allowing the scheduler to ignore the completion fence and just
> > > > > immediately push the next "job" in if its dependencies are ready
> > > > > - maybe minimal amounts of scaffolding to handle the preemption
> > > > > dma_fence because that's not entirely trivial. I think ideally we'd
> > > > > put that into drm_sched_entity since you can only ever have one active
> > > > > preempt dma_fence per gpu ctx/entity.
> > > > >
> > > >
> > > > Yep, preempt fence is per entity in Xe (xe_engine). We install these
> > > > into the VM and all external BOs mapped in the VM dma-resv slots.
> > > > Wondering if we can make all of this very generic between the DRM
> > > > scheduler + GPUVA...
> > > 
> > > I think if the drm/sched just takes care of the preempt ctx dma_fence
> > > (and still stores it in the same slot in the drm_sched_job struct like
> > > a end-of-batch dma_fence job would), and then the gpuva shared code
> > > just has functions to smash these into the right dma_resv structures
> > > then you have all the pieces. Maybe for a bit more flexibility the
> > > gpuva code takes dma_fence and not directly the drm_sched_job, but
> > > maybe even that level of integration makes sense (but imo optional, a
> > > bit of driver glue code is fine).
> > > 
> > > Yeah that's roughly what I think we should at least aim for since
> > > there's quiet a few drivers in-flight that all need these pieces (more
> > > or less at least).
> > 
> > That is very close to what I'm thinking too, we want to tackle userptr +
> > GPUVA too, think that will be next but can add this to the list of
> > things to do.
> 
> I discussed userptr+gpuva a bit with Rodrigo (and maybe Thomas H not sure
> anymore) and it sounded a bit like that's maybe a bridge too far. At least
> until we have some other drivers that also need that combo. But can't hurt
> to at least think how it would ideally integrate from xe's pov.
> -Daniel

I spoke with dakr today about on IRC, Nouveua is going to implement
userptr soon. I think the idea would be for Xe and Nouveua to
collaborate on what we stick in GPUVA for userptr + if we have common
DRM helper functions. We may land on something really small (e.g. we
store userpr address with a NULL gem in the gpuva structure) or we might
land on common locking, page population, and MMU notifier? Interested to
see where we land.

Matt

> 
> > 
> > Matt
> > 
> > > -Daniel
> > > >
> > > > Matt
> > > >
> > > > > None of this needs a dma_fence_is_lr anywhere at all.
> > > > >
> > > > > Of course there's the somewhat related issue of "how do we transport
> > > > > these userspace memory fences around from app to compositor", and
> > > > > that's a lot more gnarly. I still don't think dma_fence_is_lr is
> > > > > anywhere near what the solution should look like for that.
> > > > > -Daniel
> > > > >
> > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch