On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@xxxxxxx> wrote: > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > Hi, Christian, > > > > > > On 4/4/23 11:09, Christian König wrote: > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > >>> From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> > > >>> > > >>> For long-running workloads, drivers either need to open-code completion > > >>> waits, invent their own synchronization primitives or internally use > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but > > >>> without any lockdep annotation all these approaches are error prone. > > >>> > > >>> So since for example the drm scheduler uses dma-fences it is > > >>> desirable for > > >>> a driver to be able to use it for throttling and error handling also > > >>> with > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol. > > >>> > > >>> Introduce long-running completion fences in form of dma-fences, and add > > >>> lockdep annotation for them. In particular: > > >>> > > >>> * Do not allow waiting under any memory management locks. > > >>> * Do not allow to attach them to a dma-resv object. > > >>> * Introduce a new interface for adding callbacks making the helper > > >>> adding > > >>> a callback sign off on that it is aware that the dma-fence may not > > >>> complete anytime soon. Typically this will be the scheduler chaining > > >>> a new long-running fence on another one. > > >> > > >> Well that's pretty much what I tried before: > > >> https://lwn.net/Articles/893704/ > > >> > > >> And the reasons why it was rejected haven't changed. > > >> > > >> Regards, > > >> Christian. > > >> > > > Yes, TBH this was mostly to get discussion going how we'd best tackle > > > this problem while being able to reuse the scheduler for long-running > > > workloads. > > > > > > I couldn't see any clear decision on your series, though, but one main > > > difference I see is that this is intended for driver-internal use > > > only. (I'm counting using the drm_scheduler as a helper for > > > driver-private use). This is by no means a way to try tackle the > > > indefinite fence problem. > > > > Well this was just my latest try to tackle this, but essentially the > > problems are the same as with your approach: When we express such > > operations as dma_fence there is always the change that we leak that > > somewhere. > > > > My approach of adding a flag noting that this operation is dangerous and > > can't be synced with something memory management depends on tried to > > contain this as much as possible, but Daniel still pretty clearly > > rejected it (for good reasons I think). > > Yeah I still don't like dma_fence that somehow have totally different > semantics in that critical piece of "will it complete or will it > deadlock?" :-) Not going to touch LR dma-fences in this reply, I think we can continue the LR fence discussion of the fork of this thread I just responded to. Have a response the preempt fence discussion below. > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > the synchronization the scheduler needs in the long-running case, or > > > each driver could hack something up, like sleeping in the > > > prepare_job() or run_job() callback for throttling, but those waits > > > should still be annotated in one way or annotated one way or another > > > (and probably in a similar way across drivers) to make sure we don't > > > do anything bad. > > > > > > So any suggestions as to what would be the better solution here would > > > be appreciated. > > > > Mhm, do we really the the GPU scheduler for that? > > > > I mean in the 1 to 1 case you basically just need a component which > > collects the dependencies as dma_fence and if all of them are fulfilled > > schedules a work item. > > > > As long as the work item itself doesn't produce a dma_fence it can then > > still just wait for other none dma_fence dependencies. > > Yeah that's the important thing, for long-running jobs dependencies as > dma_fence should be totally fine. You're just not allowed to have any > outgoing dma_fences at all (except the magic preemption fence). > > > Then the work function could submit the work and wait for the result. > > > > The work item would then pretty much represent what you want, you can > > wait for it to finish and pass it along as long running dependency. > > > > Maybe give it a funky name and wrap it up in a structure, but that's > > basically it. > > Like do we need this? If the kernel ever waits for a long-running > compute job to finnish I'd call that a bug. Any functional > dependencies between engines or whatever are userspace's problem only, > which it needs to sort out using userspace memory fences. > > The only things the kernel needs are some way to track dependencies as > dma_fence (because memory management move the memory away and we need > to move it back in, ideally pipelined). And it needs the special > preempt fence (if we don't have pagefaults) so that you have a fence > to attach to all the dma_resv for memory management purposes. Now the > scheduler already has almost all the pieces (at least if we assume > there's some magic fw which time-slices these contexts on its own), > and we just need a few minimal changes: > - allowing the scheduler to ignore the completion fence and just > immediately push the next "job" in if its dependencies are ready > - maybe minimal amounts of scaffolding to handle the preemption > dma_fence because that's not entirely trivial. I think ideally we'd > put that into drm_sched_entity since you can only ever have one active > preempt dma_fence per gpu ctx/entity. > Yep, preempt fence is per entity in Xe (xe_engine). We install these into the VM and all external BOs mapped in the VM dma-resv slots. Wondering if we can make all of this very generic between the DRM scheduler + GPUVA... Matt > None of this needs a dma_fence_is_lr anywhere at all. > > Of course there's the somewhat related issue of "how do we transport > these userspace memory fences around from app to compositor", and > that's a lot more gnarly. I still don't think dma_fence_is_lr is > anywhere near what the solution should look like for that. > -Daniel > > > > Regards, > > Christian. > > > > > > > > Thanks, > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch