Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 4 May 2023 13:07:02 +0000

On Thu, May 04, 2023 at 01:07:29PM +0200, Christian König wrote:
> Am 04.05.23 um 06:54 schrieb Matthew Brost:
> > On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
> > > Adding Luben as well.
> > > 
> > > Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > > > [SNIP]
> > > > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > > > After the discussion I had with Matthew yesterday on IRC, I
> > > > realized there was no clear agreement on this. Matthew uses those 3
> > > > helpers in the Xe driver right now, and given he intends to use a
> > > > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > > > can get away without calling drm_sched_{start,stop}().
> > > > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > > > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > > > or extend the ::run_job() one to support the resubmit semantics, which,
> > > > AFAIU, is just about enforcing the job done fence (the one returned by
> > > > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > > > 
> > > > But probably more important than providing a generic helper, we should
> > > > document the resubmit semantics (AKA, what should and/or shouldn't be
> > > > done with pending jobs when a recovery happens). Because forbidding
> > > > people to use a generic helper function doesn't give any guarantee that
> > > > they'll do the right thing when coding their own logic, unless we give
> > > > clues about what's considered right/wrong, and the current state of the
> > > > doc is pretty unclear in this regard.
> > > I should probably talk about the history of the re-submit feature a bit
> > > more.
> > > 
> > > Basically AMD came up with re-submission as a cheap way of increasing the
> > > reliability of GPU resets. Problem is that it turned into an absolutely
> > > nightmare. We tried for the last 5 years or so to get that stable and it's
> > > still crashing.
> > > 
> > > The first and most major problem is that the kernel doesn't even has the
> > > information if re-submitting jobs is possible or not. For example a job
> > > which has already been pushed to the hw could have grabbed a binary
> > > semaphore and re-submitting it will just wait forever for the semaphore to
> > > be released.
> > > 
> > > The second problem is that the dma_fence semantics don't allow to ever
> > > transit the state of a fence from signaled back to unsignaled. This means
> > > that you can't re-use the hw fence and need to allocate a new one, but since
> > > memory allocation is forbidden inside a reset handler as well (YES we need
> > > to better document that part) you actually need to keep a bunch of hw fences
> > > pre-allocated around to make this work. Amdgpu choose to illegally re-use
> > > the hw fence instead which only works with quite extreme hacks.
> > > 
> > > The third problem is that the lifetime of the job object was actually
> > > defined very well before we tried to use re-submission. Basically it's just
> > > an intermediate state used between the IOCTL and pushing things to the hw,
> > > introducing this re-submit feature completely messed that up and cause quite
> > > a number of use after free errors in the past which are again only solved by
> > > quite some hacks.
> > > 
> > > What we should do in the GPU scheduler instead is the follow:
> > > 
> > > 1. Don't support re-submission at all!
> > >      Instead we can provide help to drivers to query which fences (scheduler
> > > or hw) are still not signaled yet.
> > >      This can then be used to re-create hw state if (and only if!) the driver
> > > knows what it's doing and can actually guarantee that this will work.
> > >      E.g. the case for XE where the kernel driver knows the contexts which
> > > were not running at the time and can re-create their queues.
> > > 
> > > 2. We can provide both a wq to use for single threaded application as well
> > > as start/stop semantics.
> > >      It's just that the start/stop semantics should never touch what was
> > > already submitted, but rather just make sure that we don't get any new
> > > submissions.
> > > 
> > I pretty much agree with everything Christian has said here and Xe
> > aligns with this. Let me explain what Xe does.
> > 
> > 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
> > entity hung, entity IOMMU CAT error, etc...):
> > 	- No re-submission at all
> > 	- ban the entity
> > 	- notify the UMD
> > 	- cleanup all pending jobs / fences
> > 2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
> > never happen):
> > 	- stop all schedulers (same as a entity in Xe because 1 to 1)
> > 	- cleanup odd entity state related to communication with the
> > 	  firmware
> > 	- check if an entity has a job that started but not finished, if
> > 	  so ban it (same mechanism as above)
> > 	- resubmit all jobs from good entities
> 
> As long as you can do this without creating new dma_fence objects for the hw
> submissions everything should be fine.
> 

Yep, same fence.

> > 	- start all schedulers (same as a entity in Xe because 1 to 1)
> > 
> > The implementation for this in the following file [1]. Search for the
> > drm scheduler functions and you should be able to find implementation
> > easily.
> > 
> > If you want to use an ordered work queue to avoid the stop / start dance
> > great do that, in Xe the stop / start dance works. I have extensively
> > tested this and the flow is rock solid and please trust me when I say
> > this as I worked on reset flows in the i915 and fought bugs for years, I
> > still don't think it is in the i915 because we try to do resubmission +
> > crazy races. Because of that I was incredibly paranoid when I
> > implemented this + tested it extensively.
> > 
> > I'll post an updated version of my DRM scheduler series [2] on the list
> > soon for the WQ changes, plus whatever else is required for Xe.
> > 
> > So my take from this discussion is maybe with a little documentation, we
> > are good. Thoughts?
> 
> For XE what you describe above basically sounds perfectly fine to me.
> 
> I strongly assume when you re-submit things you just tell your hw scheduler
> to pick up at the place it left of? E.g. set a read pointer and write
> pointer of a ring buffer appropriately?
>

Yep.

Matt

> Christian.
> 
> > 
> > Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
> > [2] https://patchwork.freedesktop.org/series/116055/
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Regards,
> > > > 
> > > > Boris
>