On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > Hi, Christian, > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > completion > > > > > > > waits, invent their own synchronization primitives or internally use > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but > > > > > > > without any lockdep annotation all these approaches are error prone. > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > desirable for > > > > > > > a driver to be able to use it for throttling and error > > > > > > > handling also with > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > dma-fences, and add > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > helper adding > > > > > > > a callback sign off on that it is aware that the dma-fence may not > > > > > > > complete anytime soon. Typically this will be the > > > > > > > scheduler chaining > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > I don't think this quite the same, this explictly enforces that we don't > > break the dma-fence rules (in path of memory allocations, exported in > > any way), essentially this just SW sync point reusing dma-fence the > > infrastructure for signaling / callbacks. I believe your series tried to > > export these fences to user space (admittedly I haven't fully read your > > series). > > > > In this use case we essentially just want to flow control the ring via > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > used for cleanup if LR entity encounters an error. To me this seems > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > in the backend regardless on ring space, maintain a list of jobs pending > > for cleanup after errors, and write a different cleanup path as now the > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > when the DRM scheduler provides all of this for us. Also if we go this > > route, now all drivers are going to invent ways to handle LR jobs /w the > > DRM scheduler. > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > export any fences from the scheduler. If you try to export these fences > > a blow up happens. > > The problem is if you mix things up. Like for resets you need all the > schedulers on an engine/set-of-engines to quiescent or things get > potentially hilarious. If you now have a scheduler in forever limbo, the > dma_fence guarantees are right out the window. > Right, a GT reset on Xe is: Stop all schedulers Do a reset Ban any schedulers which we think caused the GT reset Resubmit all schedulers which we think were good Restart all schedulers None of this flow depends on LR dma-fences, all of this uses the DRM sched infrastructure and work very well compared to the i915. Rewriting all this with a driver specific implementation is what we are trying to avoid. Similarly if LR entity hangs on its own (not a GT reset, rather the firmware does the reset for us) we use all the DRM scheduler infrastructure to handle this. Again this works rather well... > But the issue you're having is fairly specific if it's just about > ringspace. I think the dumbest fix is to just block in submit if you run > out of per-ctx ringspace, and call it a day. This notion that somehow the How does that not break the dma-fence rules? A job can publish its finished fence after ARM, if the finished fence fence waits on ring space that may not free up in a reasonable amount of time we now have broken the dma-dence rules. My understanding is any dma-fence must only on other dma-fence, Christian seems to agree and NAK'd just blocking if no space available [1]. IMO this series ensures we don't break dma-fence rules by restricting how the finished fence can be used. > kernel is supposed to provide a bottomless queue of anything userspace > submits simply doesn't hold up in reality (as much as userspace standards > committees would like it to), and as long as it doesn't have a real-world > perf impact it doesn't really matter why we end up blocking in the submit > ioctl. It might also be a simple memory allocation that hits a snag in > page reclaim. > > > > > > > And the reasons why it was rejected haven't changed. > > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > > > tackle this problem while being able to reuse the scheduler for > > > > > long-running workloads. > > > > > > > > > > I couldn't see any clear decision on your series, though, but one > > > > > main difference I see is that this is intended for driver-internal > > > > > use only. (I'm counting using the drm_scheduler as a helper for > > > > > driver-private use). This is by no means a way to try tackle the > > > > > indefinite fence problem. > > > > > > > > Well this was just my latest try to tackle this, but essentially the > > > > problems are the same as with your approach: When we express such > > > > operations as dma_fence there is always the change that we leak that > > > > somewhere. > > > > > > > > My approach of adding a flag noting that this operation is dangerous and > > > > can't be synced with something memory management depends on tried to > > > > contain this as much as possible, but Daniel still pretty clearly > > > > rejected it (for good reasons I think). > > > > > > > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > > > the synchronization the scheduler needs in the long-running case, or > > > > > each driver could hack something up, like sleeping in the > > > > > prepare_job() or run_job() callback for throttling, but those waits > > > > > should still be annotated in one way or annotated one way or another > > > > > (and probably in a similar way across drivers) to make sure we don't > > > > > do anything bad. > > > > > > > > > > So any suggestions as to what would be the better solution here > > > > > would be appreciated. > > > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > > > I think we need to solve this within the DRM scheduler one way or > > another. > > Yeah so if we conclude that the queue really must be bottomless then I > agree drm-sched should help out sort out the mess. Because I'm guessing > that every driver will have this issue. But that's a big if. > > I guess if we teach the drm scheduler that some jobs are fairly endless > then maybe it wouldn't be too far-fetched to also teach it to wait for a > previous one to finish (but not with the dma_fence that preempts, which we > put into the dma_resv for memory management, but some other struct > completion). The scheduler already has a concept of not stuffing too much > stuff into the same queue after all, so this should fit? See above, exact same situation as spinning on flow controling the ring, this IMO absolutely breaks the dma-fence rules. IMO the correct solution is to have a DRM that doesn't export dma-fences, this is exactly what this series does as if we try to, boom lockdep / warn on blow up. Matt [1] https://patchwork.freedesktop.org/patch/525461/?series=114772&rev=2 > -Daniel > > > > > > I mean in the 1 to 1 case you basically just need a component which > > > > collects the dependencies as dma_fence and if all of them are fulfilled > > > > schedules a work item. > > > > > > > > As long as the work item itself doesn't produce a dma_fence it can then > > > > still just wait for other none dma_fence dependencies. > > > > > > > > Then the work function could submit the work and wait for the result. > > > > > > > > The work item would then pretty much represent what you want, you can > > > > wait for it to finish and pass it along as long running dependency. > > > > > > > > Maybe give it a funky name and wrap it up in a structure, but that's > > > > basically it. > > > > > > > This very much sounds like a i915_sw_fence for the dependency tracking and > > > dma_fence_work for the actual work although it's completion fence is a > > > dma_fence. > > > > > > > Agree this does sound to i915ish as stated below one of mandates in Xe > > was to use the DRM scheduler. Beyond that as someone who a submission > > backend in the i915 and Xe, I love how the DRM scheduler works (single > > entry point), it makes everything so much easier. > > > > Matt > > > > > Although that goes against the whole idea of a condition for merging the xe > > > driver would be that we implement some sort of minimal scaffolding for > > > long-running workloads in the drm scheduler, and the thinking behind that is > > > to avoid implementing intel-specific solutions like those... > > > > > > Thanks, > > > > > > Thomas > > > > > > > > > > > > > Regards, > > > > Christian. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch