Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 6 Apr 2023 16:58:33 +0000

On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote:
> > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > > > 
> > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > Hi, Christian,
> > > > > > > > > 
> > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > > > > > > > > 
> > > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > > completion
> > > > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > > > 
> > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > > desirable for
> > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > handling also with
> > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > 
> > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > 
> > > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > > helper adding
> > > > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > > scheduler chaining
> > > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > > 
> > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > 
> > > > > > 
> > > > > > I don't think this quite the same, this explictly enforces that we don't
> > > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > > > export these fences to user space (admittedly I haven't fully read your
> > > > > > series).
> > > > > > 
> > > > > > In this use case we essentially just want to flow control the ring via
> > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > > > 
> > > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > > > for cleanup after errors, and write a different cleanup path as now the
> > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > > > DRM scheduler.
> > > > > > 
> > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > export any fences from the scheduler. If you try to export these fences
> > > > > > a blow up happens.
> > > > > 
> > > > > The problem is if you mix things up. Like for resets you need all the
> > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > > dma_fence guarantees are right out the window.
> > > > > 
> > > > 
> > > > Right, a GT reset on Xe is:
> > > > 
> > > > Stop all schedulers
> > > > Do a reset
> > > > Ban any schedulers which we think caused the GT reset
> > > > Resubmit all schedulers which we think were good
> > > > Restart all schedulers
> > > > 
> > > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > > sched infrastructure and work very well compared to the i915. Rewriting
> > > > all this with a driver specific implementation is what we are trying to
> > > > avoid.
> > > > 
> > > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > > firmware does the reset for us) we use all the DRM scheduler
> > > > infrastructure to handle this. Again this works rather well...
> > > 
> > > Yeah this is why I don't think duplicating everything that long-running
> > > jobs need makes any sense. iow I agree with you.
> > > 
> > 
> > Glad we agree.
> > 
> > > > > But the issue you're having is fairly specific if it's just about
> > > > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > > > 
> > > > How does that not break the dma-fence rules? A job can publish its
> > > > finished fence after ARM, if the finished fence fence waits on ring
> > > > space that may not free up in a reasonable amount of time we now have
> > > > broken the dma-dence rules. My understanding is any dma-fence must only
> > > > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > > > no space available [1]. IMO this series ensures we don't break dma-fence
> > > > rules by restricting how the finished fence can be used.
> > > 
> > > Oh I meant in the submit ioctl, _before_ you even call
> > > drm_sched_job_arm(). It's ok to block in there indefinitely.
> > >
> > 
> > Ok, but how do we determine if their is ring space, wait on xe_hw_fence
> > which is a dma-fence. We just move a wait from the scheduler to the exec
> > IOCTL and I realy fail to see the point of that.
> 
> Fill in anything you need into the ring at ioctl time, but don't update
> the tail pointers? If there's no space, then EWOULDBLCK.
> 

Ok, I can maybe buy this approach and this is fairly easy to do. I'm
going to do this for LR jobs only though (non-LR job will still flow
control on the ring via the scheduler + write ring in run_job). A bit of
duplicate code but I can live with this.

> > > > > kernel is supposed to provide a bottomless queue of anything userspace
> > > > > submits simply doesn't hold up in reality (as much as userspace standards
> > > > > committees would like it to), and as long as it doesn't have a real-world
> > > > > perf impact it doesn't really matter why we end up blocking in the submit
> > > > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > > > page reclaim.
> > > > > 
> > > > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > > > long-running workloads.
> > > > > > > > > 
> > > > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > > > indefinite fence problem.
> > > > > > > > 
> > > > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > > > problems are the same as with your approach: When we express such
> > > > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > > > somewhere.
> > > > > > > > 
> > > > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > > > can't be synced with something memory management depends on tried to
> > > > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > > > rejected it (for good reasons I think).
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > > > do anything bad.
> > > > > > > > > 
> > > > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > > > would be appreciated.
> > > > > > > > 
> > > > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > > > 
> > > > > > 
> > > > > > I think we need to solve this within the DRM scheduler one way or
> > > > > > another.
> > > > > 
> > > > > Yeah so if we conclude that the queue really must be bottomless then I
> > > > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > > > that every driver will have this issue. But that's a big if.
> > > > > 
> > > > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > > > previous one to finish (but not with the dma_fence that preempts, which we
> > > > > put into the dma_resv for memory management, but some other struct
> > > > > completion). The scheduler already has a concept of not stuffing too much
> > > > > stuff into the same queue after all, so this should fit?
> > > > 
> > > > See above, exact same situation as spinning on flow controling the ring,
> > > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > > > is to have a DRM that doesn't export dma-fences, this is exactly what
> > > > this series does as if we try to, boom lockdep / warn on blow up.
> > > 
> > > I dont think it's impossible to do this correctly, but definitely very,
> > > very hard. Which is why neither Christian nor me like the idea :-)
> > > 
> > > Essentially you'd have to make sure that any indefinite way will still
> > > react to drm_sched_job, so that you're not holding up a gt reset or
> > > anything like that, but only ever hold up forward progress for this
> > > specific scheduler/drm_sched_entity. Which you can do as long (and again,
> > > another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> > > manage to preempt the underlying long-running ctx even when the drm/sched
> > > is stuck waiting for an indefinite fence (like waiting for ringspace or
> > > something like that).
> > > 
> > > So I don't think it's impossible, but very far away from "a good idea" :-)
> > > 
> > > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> > > back to userspace directly from the ioctl function, where you still can do
> > > that without breaking any dma_fence rules. Or if it's not a case that
> > > matters in practice, simply block in the ioctl handler instead of
> > > returning EWOULDBLCK.
> > 
> > Returning EWOULDBLCK on a full ring is reasonsible I guess but again
> > without returning a fence in run job the TDR can't be used for clean up
> > on LR entities which will result in duplicate code open coded by each
> > driver. Same goes for checking ring full in exec.
> > 
> > How about this:
> > - We mark xe_hw_fence as LR to ensure it can't be exported, return this
> >   in run_job which gives flow control on the ring + the handy TDR
> >   functionality
> > - When a scheduler is marked as LR, we do not generate finished fences
> >   for jobs
> > - We heavily, heavily scrutinize any usage of the LR fence flag going
> >   foward
> > - We document all of this very loudly
> > 
> > Is this reasonable?
> 
> I'm not seeing why it's needed? If you're worried about TDR duplication
> then I think we need something else. Because for long-running ctx we never
> have a timeout of the ctx itself (by definition). The only thing we time
> out on is the preempt, so I guess what could be done:
> - have the minimal scaffolding to support the preempt-ctx fence in
>   drm_sched_entity
> - when the preempt ctx fence enables signalling a) callback to the driver
>   to start the preempt (which should signal the fence) b) start a timer,
>   which should catch if the preempt takes too long

The GuC does this for us, no need.

> - if the timeout first (importantly we only enable that when the
>   preemption is trigger, not by default), kick of the normal drm/sched tdr
>   flow. maybe needs some adjustements in case there's different handling
>   needed for when a preemption times out compared to just a job timing out
>

The GuC imforms us this and yea we kick the TDR.

> I think this might make sense for sharing timeout handling needs for
> long-running context. What you proposed I don't really follow why it
> should exist, because that kind of timeout handling should not ever happen
> for long-running jobs.

We use just the TDR a as single cleanup point for all entities. In the
case of a LR entity this occurs if the GuC issues a reset on the
entity (liekly preempt timeout), the entity takes a non-recoverable page
fail, or the entity to the root cause of a GT reset. The pending job
list here is handy, that why I wanted to return xe_hw_fence in run_job
to hold the job in the scheduler pending list. The doesn't TDR won't
fire if the pending list is empty.

Based on what you are saying my new proposal:

1. Write ring in exec for LR jobs, return -EWOULDBLCK if no space in
ring
2. Return NULL in run_job (or alternatively a signaled fence)
3. Have specical cased cleanup flow for LR entites (not the TDR, rather
likely a different worker we kick owned by the xe_engine).
4. Document this some that this how drivers are expected to do LR
workloads plus DRM scheduler

1 & 3 are pretty clear duplicates of code but I can live with that if
I can get Ack on the plan + move on. The coding will not be all that
difficult either, I am just being difficult. In the is probably a 100ish
lines of code.

What do you think Daniel, seem like a reasonable plan?

Matt

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch