Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 4 May 2023 04:54:28 +0000

On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
> Adding Luben as well.
> 
> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > [SNIP]
> > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > After the discussion I had with Matthew yesterday on IRC, I
> > realized there was no clear agreement on this. Matthew uses those 3
> > helpers in the Xe driver right now, and given he intends to use a
> > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > can get away without calling drm_sched_{start,stop}().
> > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > or extend the ::run_job() one to support the resubmit semantics, which,
> > AFAIU, is just about enforcing the job done fence (the one returned by
> > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > 
> > But probably more important than providing a generic helper, we should
> > document the resubmit semantics (AKA, what should and/or shouldn't be
> > done with pending jobs when a recovery happens). Because forbidding
> > people to use a generic helper function doesn't give any guarantee that
> > they'll do the right thing when coding their own logic, unless we give
> > clues about what's considered right/wrong, and the current state of the
> > doc is pretty unclear in this regard.
> 
> I should probably talk about the history of the re-submit feature a bit
> more.
> 
> Basically AMD came up with re-submission as a cheap way of increasing the
> reliability of GPU resets. Problem is that it turned into an absolutely
> nightmare. We tried for the last 5 years or so to get that stable and it's
> still crashing.
> 
> The first and most major problem is that the kernel doesn't even has the
> information if re-submitting jobs is possible or not. For example a job
> which has already been pushed to the hw could have grabbed a binary
> semaphore and re-submitting it will just wait forever for the semaphore to
> be released.
> 
> The second problem is that the dma_fence semantics don't allow to ever
> transit the state of a fence from signaled back to unsignaled. This means
> that you can't re-use the hw fence and need to allocate a new one, but since
> memory allocation is forbidden inside a reset handler as well (YES we need
> to better document that part) you actually need to keep a bunch of hw fences
> pre-allocated around to make this work. Amdgpu choose to illegally re-use
> the hw fence instead which only works with quite extreme hacks.
> 
> The third problem is that the lifetime of the job object was actually
> defined very well before we tried to use re-submission. Basically it's just
> an intermediate state used between the IOCTL and pushing things to the hw,
> introducing this re-submit feature completely messed that up and cause quite
> a number of use after free errors in the past which are again only solved by
> quite some hacks.
> 
> What we should do in the GPU scheduler instead is the follow:
> 
> 1. Don't support re-submission at all!
>     Instead we can provide help to drivers to query which fences (scheduler
> or hw) are still not signaled yet.
>     This can then be used to re-create hw state if (and only if!) the driver
> knows what it's doing and can actually guarantee that this will work.
>     E.g. the case for XE where the kernel driver knows the contexts which
> were not running at the time and can re-create their queues.
> 
> 2. We can provide both a wq to use for single threaded application as well
> as start/stop semantics.
>     It's just that the start/stop semantics should never touch what was
> already submitted, but rather just make sure that we don't get any new
> submissions.
> 

I pretty much agree with everything Christian has said here and Xe
aligns with this. Let me explain what Xe does.

1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
entity hung, entity IOMMU CAT error, etc...):
	- No re-submission at all
	- ban the entity
	- notify the UMD
	- cleanup all pending jobs / fences
2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
never happen):
	- stop all schedulers (same as a entity in Xe because 1 to 1)
	- cleanup odd entity state related to communication with the
	  firmware
	- check if an entity has a job that started but not finished, if
	  so ban it (same mechanism as above)
	- resubmit all jobs from good entities
	- start all schedulers (same as a entity in Xe because 1 to 1)

The implementation for this in the following file [1]. Search for the
drm scheduler functions and you should be able to find implementation
easily.

If you want to use an ordered work queue to avoid the stop / start dance
great do that, in Xe the stop / start dance works. I have extensively
tested this and the flow is rock solid and please trust me when I say
this as I worked on reset flows in the i915 and fought bugs for years, I
still don't think it is in the i915 because we try to do resubmission +
crazy races. Because of that I was incredibly paranoid when I
implemented this + tested it extensively.

I'll post an updated version of my DRM scheduler series [2] on the list
soon for the WQ changes, plus whatever else is required for Xe.

So my take from this discussion is maybe with a little documentation, we
are good. Thoughts?

Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
[2] https://patchwork.freedesktop.org/series/116055/

> Regards,
> Christian.
> 
> > 
> > Regards,
> > 
> > Boris
>