On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote: > On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote: > > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote: > > > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote: > > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > > > > > > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > > > Hi, Christian, > > > > > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> > > > > > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > > > > > completion > > > > > > > > > > > waits, invent their own synchronization primitives or internally use > > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but > > > > > > > > > > > without any lockdep annotation all these approaches are error prone. > > > > > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > > > > > desirable for > > > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > > > handling also with > > > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > > > dma-fences, and add > > > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > > > > > helper adding > > > > > > > > > > > a callback sign off on that it is aware that the dma-fence may not > > > > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > > > > scheduler chaining > > > > > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that we don't > > > > > > break the dma-fence rules (in path of memory allocations, exported in > > > > > > any way), essentially this just SW sync point reusing dma-fence the > > > > > > infrastructure for signaling / callbacks. I believe your series tried to > > > > > > export these fences to user space (admittedly I haven't fully read your > > > > > > series). > > > > > > > > > > > > In this use case we essentially just want to flow control the ring via > > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > > > > > used for cleanup if LR entity encounters an error. To me this seems > > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > > > > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > > > > > in the backend regardless on ring space, maintain a list of jobs pending > > > > > > for cleanup after errors, and write a different cleanup path as now the > > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > > > > > when the DRM scheduler provides all of this for us. Also if we go this > > > > > > route, now all drivers are going to invent ways to handle LR jobs /w the > > > > > > DRM scheduler. > > > > > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > > > export any fences from the scheduler. If you try to export these fences > > > > > > a blow up happens. > > > > > > > > > > The problem is if you mix things up. Like for resets you need all the > > > > > schedulers on an engine/set-of-engines to quiescent or things get > > > > > potentially hilarious. If you now have a scheduler in forever limbo, the > > > > > dma_fence guarantees are right out the window. > > > > > > > > > > > > > Right, a GT reset on Xe is: > > > > > > > > Stop all schedulers > > > > Do a reset > > > > Ban any schedulers which we think caused the GT reset > > > > Resubmit all schedulers which we think were good > > > > Restart all schedulers > > > > > > > > None of this flow depends on LR dma-fences, all of this uses the DRM > > > > sched infrastructure and work very well compared to the i915. Rewriting > > > > all this with a driver specific implementation is what we are trying to > > > > avoid. > > > > > > > > Similarly if LR entity hangs on its own (not a GT reset, rather the > > > > firmware does the reset for us) we use all the DRM scheduler > > > > infrastructure to handle this. Again this works rather well... > > > > > > Yeah this is why I don't think duplicating everything that long-running > > > jobs need makes any sense. iow I agree with you. > > > > > > > Glad we agree. > > > > > > > But the issue you're having is fairly specific if it's just about > > > > > ringspace. I think the dumbest fix is to just block in submit if you run > > > > > out of per-ctx ringspace, and call it a day. This notion that somehow the > > > > > > > > How does that not break the dma-fence rules? A job can publish its > > > > finished fence after ARM, if the finished fence fence waits on ring > > > > space that may not free up in a reasonable amount of time we now have > > > > broken the dma-dence rules. My understanding is any dma-fence must only > > > > on other dma-fence, Christian seems to agree and NAK'd just blocking if > > > > no space available [1]. IMO this series ensures we don't break dma-fence > > > > rules by restricting how the finished fence can be used. > > > > > > Oh I meant in the submit ioctl, _before_ you even call > > > drm_sched_job_arm(). It's ok to block in there indefinitely. > > > > > > > Ok, but how do we determine if their is ring space, wait on xe_hw_fence > > which is a dma-fence. We just move a wait from the scheduler to the exec > > IOCTL and I realy fail to see the point of that. > > Fill in anything you need into the ring at ioctl time, but don't update > the tail pointers? If there's no space, then EWOULDBLCK. > Ok, I can maybe buy this approach and this is fairly easy to do. I'm going to do this for LR jobs only though (non-LR job will still flow control on the ring via the scheduler + write ring in run_job). A bit of duplicate code but I can live with this. > > > > > kernel is supposed to provide a bottomless queue of anything userspace > > > > > submits simply doesn't hold up in reality (as much as userspace standards > > > > > committees would like it to), and as long as it doesn't have a real-world > > > > > perf impact it doesn't really matter why we end up blocking in the submit > > > > > ioctl. It might also be a simple memory allocation that hits a snag in > > > > > page reclaim. > > > > > > > > > > > > > > > And the reasons why it was rejected haven't changed. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > > > > > > > tackle this problem while being able to reuse the scheduler for > > > > > > > > > long-running workloads. > > > > > > > > > > > > > > > > > > I couldn't see any clear decision on your series, though, but one > > > > > > > > > main difference I see is that this is intended for driver-internal > > > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for > > > > > > > > > driver-private use). This is by no means a way to try tackle the > > > > > > > > > indefinite fence problem. > > > > > > > > > > > > > > > > Well this was just my latest try to tackle this, but essentially the > > > > > > > > problems are the same as with your approach: When we express such > > > > > > > > operations as dma_fence there is always the change that we leak that > > > > > > > > somewhere. > > > > > > > > > > > > > > > > My approach of adding a flag noting that this operation is dangerous and > > > > > > > > can't be synced with something memory management depends on tried to > > > > > > > > contain this as much as possible, but Daniel still pretty clearly > > > > > > > > rejected it (for good reasons I think). > > > > > > > > > > > > > > > > > > > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > > > > > > > the synchronization the scheduler needs in the long-running case, or > > > > > > > > > each driver could hack something up, like sleeping in the > > > > > > > > > prepare_job() or run_job() callback for throttling, but those waits > > > > > > > > > should still be annotated in one way or annotated one way or another > > > > > > > > > (and probably in a similar way across drivers) to make sure we don't > > > > > > > > > do anything bad. > > > > > > > > > > > > > > > > > > So any suggestions as to what would be the better solution here > > > > > > > > > would be appreciated. > > > > > > > > > > > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > > > > > > > > > > > > > > > I think we need to solve this within the DRM scheduler one way or > > > > > > another. > > > > > > > > > > Yeah so if we conclude that the queue really must be bottomless then I > > > > > agree drm-sched should help out sort out the mess. Because I'm guessing > > > > > that every driver will have this issue. But that's a big if. > > > > > > > > > > I guess if we teach the drm scheduler that some jobs are fairly endless > > > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a > > > > > previous one to finish (but not with the dma_fence that preempts, which we > > > > > put into the dma_resv for memory management, but some other struct > > > > > completion). The scheduler already has a concept of not stuffing too much > > > > > stuff into the same queue after all, so this should fit? > > > > > > > > See above, exact same situation as spinning on flow controling the ring, > > > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution > > > > is to have a DRM that doesn't export dma-fences, this is exactly what > > > > this series does as if we try to, boom lockdep / warn on blow up. > > > > > > I dont think it's impossible to do this correctly, but definitely very, > > > very hard. Which is why neither Christian nor me like the idea :-) > > > > > > Essentially you'd have to make sure that any indefinite way will still > > > react to drm_sched_job, so that you're not holding up a gt reset or > > > anything like that, but only ever hold up forward progress for this > > > specific scheduler/drm_sched_entity. Which you can do as long (and again, > > > another hugely tricky detail) you still obey the preempt-ctx dma_fence and > > > manage to preempt the underlying long-running ctx even when the drm/sched > > > is stuck waiting for an indefinite fence (like waiting for ringspace or > > > something like that). > > > > > > So I don't think it's impossible, but very far away from "a good idea" :-) > > > > > > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK > > > back to userspace directly from the ioctl function, where you still can do > > > that without breaking any dma_fence rules. Or if it's not a case that > > > matters in practice, simply block in the ioctl handler instead of > > > returning EWOULDBLCK. > > > > Returning EWOULDBLCK on a full ring is reasonsible I guess but again > > without returning a fence in run job the TDR can't be used for clean up > > on LR entities which will result in duplicate code open coded by each > > driver. Same goes for checking ring full in exec. > > > > How about this: > > - We mark xe_hw_fence as LR to ensure it can't be exported, return this > > in run_job which gives flow control on the ring + the handy TDR > > functionality > > - When a scheduler is marked as LR, we do not generate finished fences > > for jobs > > - We heavily, heavily scrutinize any usage of the LR fence flag going > > foward > > - We document all of this very loudly > > > > Is this reasonable? > > I'm not seeing why it's needed? If you're worried about TDR duplication > then I think we need something else. Because for long-running ctx we never > have a timeout of the ctx itself (by definition). The only thing we time > out on is the preempt, so I guess what could be done: > - have the minimal scaffolding to support the preempt-ctx fence in > drm_sched_entity > - when the preempt ctx fence enables signalling a) callback to the driver > to start the preempt (which should signal the fence) b) start a timer, > which should catch if the preempt takes too long The GuC does this for us, no need. > - if the timeout first (importantly we only enable that when the > preemption is trigger, not by default), kick of the normal drm/sched tdr > flow. maybe needs some adjustements in case there's different handling > needed for when a preemption times out compared to just a job timing out > The GuC imforms us this and yea we kick the TDR. > I think this might make sense for sharing timeout handling needs for > long-running context. What you proposed I don't really follow why it > should exist, because that kind of timeout handling should not ever happen > for long-running jobs. We use just the TDR a as single cleanup point for all entities. In the case of a LR entity this occurs if the GuC issues a reset on the entity (liekly preempt timeout), the entity takes a non-recoverable page fail, or the entity to the root cause of a GT reset. The pending job list here is handy, that why I wanted to return xe_hw_fence in run_job to hold the job in the scheduler pending list. The doesn't TDR won't fire if the pending list is empty. Based on what you are saying my new proposal: 1. Write ring in exec for LR jobs, return -EWOULDBLCK if no space in ring 2. Return NULL in run_job (or alternatively a signaled fence) 3. Have specical cased cleanup flow for LR entites (not the TDR, rather likely a different worker we kick owned by the xe_engine). 4. Document this some that this how drivers are expected to do LR workloads plus DRM scheduler 1 & 3 are pretty clear duplicates of code but I can live with that if I can get Ack on the plan + move on. The coding will not be all that difficult either, I am just being difficult. In the is probably a 100ish lines of code. What do you think Daniel, seem like a reasonable plan? Matt > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch