Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Wed, 3 May 2023 11:49:27 +0200

On Wed, 3 May 2023 10:47:43 +0200
Christian König <christian.koenig@xxxxxxx> wrote:

> Adding Luben as well.
> 
> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > [SNIP]  
> >> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().  
> > After the discussion I had with Matthew yesterday on IRC, I
> > realized there was no clear agreement on this. Matthew uses those 3
> > helpers in the Xe driver right now, and given he intends to use a
> > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > can get away without calling drm_sched_{start,stop}().
> > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > or extend the ::run_job() one to support the resubmit semantics, which,
> > AFAIU, is just about enforcing the job done fence (the one returned by
> > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> >
> > But probably more important than providing a generic helper, we should
> > document the resubmit semantics (AKA, what should and/or shouldn't be
> > done with pending jobs when a recovery happens). Because forbidding
> > people to use a generic helper function doesn't give any guarantee that
> > they'll do the right thing when coding their own logic, unless we give
> > clues about what's considered right/wrong, and the current state of the
> > doc is pretty unclear in this regard.  
> 
> I should probably talk about the history of the re-submit feature a bit 
> more.
> 
> Basically AMD came up with re-submission as a cheap way of increasing 
> the reliability of GPU resets. Problem is that it turned into an 
> absolutely nightmare. We tried for the last 5 years or so to get that 
> stable and it's still crashing.
> 
> The first and most major problem is that the kernel doesn't even has the 
> information if re-submitting jobs is possible or not. For example a job 
> which has already been pushed to the hw could have grabbed a binary 
> semaphore and re-submitting it will just wait forever for the semaphore 
> to be released.

That's a valid point. Definitely something worth pointing out in the
doc IMHO.

> 
> The second problem is that the dma_fence semantics don't allow to ever 
> transit the state of a fence from signaled back to unsignaled. This 
> means that you can't re-use the hw fence and need to allocate a new one, 
> but since memory allocation is forbidden inside a reset handler as well 
> (YES we need to better document that part) you actually need to keep a 
> bunch of hw fences pre-allocated around to make this work. Amdgpu choose 
> to illegally re-use the hw fence instead which only works with quite 
> extreme hacks.

Hm, maybe I'm missing something, but I don't really see why we'd ever go
from signaled to unsignaled in the first place. If the parent fence is
signaled by the time we reach the resubmit function (because the GPU
keeps executing stuff after we called drm_sched_stop() and before we
actually shut it down), the ::run_job() hook would just return a
signaled fence at re-submission time, and the actual re-submission can
be skipped. If the job hasn't finished before the reset, the fence
should still be unsignaled, so no unsignaled -> signaled state
transition AFAICT.

> 
> The third problem is that the lifetime of the job object was actually 
> defined very well before we tried to use re-submission. Basically it's 
> just an intermediate state used between the IOCTL and pushing things to 
> the hw, introducing this re-submit feature completely messed that up and 
> cause quite a number of use after free errors in the past which are 
> again only solved by quite some hacks.

It's not clear to me what the resubmit logic has to do with that,
but I trust you.

> 
> What we should do in the GPU scheduler instead is the follow:
> 
> 1. Don't support re-submission at all!
>      Instead we can provide help to drivers to query which fences 
> (scheduler or hw) are still not signaled yet.
>      This can then be used to re-create hw state if (and only if!) the 
> driver knows what it's doing and can actually guarantee that this will work.
>      E.g. the case for XE where the kernel driver knows the contexts 
> which were not running at the time and can re-create their queues.

Okay, that basically means any queue that had in-flight jobs at
recovery time becomes unusable after that point, because skipping one
job might put you in an inconsistent state.

> 
> 2. We can provide both a wq to use for single threaded application as 
> well as start/stop semantics.

There's no risk calling stop+start even in the single-threaded wq
case, and the overhead should be negligible. If we want to keep things
simple, let's just say drivers should always call those :-).

>      It's just that the start/stop semantics should never touch what was 
> already submitted, but rather just make sure that we don't get any new 
> submissions.

So, something like that:

1. Call drm_sched_stop()
2. Iterate over all pending jobs, and flag the entity as dirty and/or
   kill it if any of the jobs were unfinished when the GPU was reset
3. Call drm_sched_start()