Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Christian König <christian.koenig@xxxxxxx> · Thu, 4 May 2023 13:07:29 +0200

Am 04.05.23 um 06:54 schrieb Matthew Brost:
On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
Adding Luben as well.

Am 03.05.23 um 10:16 schrieb Boris Brezillon:
[SNIP]
To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
After the discussion I had with Matthew yesterday on IRC, I
realized there was no clear agreement on this. Matthew uses those 3
helpers in the Xe driver right now, and given he intends to use a
multi-threaded wq for its 1:1 schedulers run queue, there's no way he
can get away without calling drm_sched_{start,stop}().
drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
wondering if it wouldn't be preferable to add a ::resubmit_job() method
or extend the ::run_job() one to support the resubmit semantics, which,
AFAIU, is just about enforcing the job done fence (the one returned by
::run_job()) doesn't transition from a signaled to an unsignaled state.

But probably more important than providing a generic helper, we should
document the resubmit semantics (AKA, what should and/or shouldn't be
done with pending jobs when a recovery happens). Because forbidding
people to use a generic helper function doesn't give any guarantee that
they'll do the right thing when coding their own logic, unless we give
clues about what's considered right/wrong, and the current state of the
doc is pretty unclear in this regard.
I should probably talk about the history of the re-submit feature a bit
more.

Basically AMD came up with re-submission as a cheap way of increasing the
reliability of GPU resets. Problem is that it turned into an absolutely
nightmare. We tried for the last 5 years or so to get that stable and it's
still crashing.

The first and most major problem is that the kernel doesn't even has the
information if re-submitting jobs is possible or not. For example a job
which has already been pushed to the hw could have grabbed a binary
semaphore and re-submitting it will just wait forever for the semaphore to
be released.

The second problem is that the dma_fence semantics don't allow to ever
transit the state of a fence from signaled back to unsignaled. This means
that you can't re-use the hw fence and need to allocate a new one, but since
memory allocation is forbidden inside a reset handler as well (YES we need
to better document that part) you actually need to keep a bunch of hw fences
pre-allocated around to make this work. Amdgpu choose to illegally re-use
the hw fence instead which only works with quite extreme hacks.

The third problem is that the lifetime of the job object was actually
defined very well before we tried to use re-submission. Basically it's just
an intermediate state used between the IOCTL and pushing things to the hw,
introducing this re-submit feature completely messed that up and cause quite
a number of use after free errors in the past which are again only solved by
quite some hacks.

What we should do in the GPU scheduler instead is the follow:

1. Don't support re-submission at all!
     Instead we can provide help to drivers to query which fences (scheduler
or hw) are still not signaled yet.
     This can then be used to re-create hw state if (and only if!) the driver
knows what it's doing and can actually guarantee that this will work.
     E.g. the case for XE where the kernel driver knows the contexts which
were not running at the time and can re-create their queues.

2. We can provide both a wq to use for single threaded application as well
as start/stop semantics.
     It's just that the start/stop semantics should never touch what was
already submitted, but rather just make sure that we don't get any new
submissions.

I pretty much agree with everything Christian has said here and Xe
aligns with this. Let me explain what Xe does.

1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
entity hung, entity IOMMU CAT error, etc...):
	- No re-submission at all
	- ban the entity
	- notify the UMD
	- cleanup all pending jobs / fences
2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
never happen):
	- stop all schedulers (same as a entity in Xe because 1 to 1)
	- cleanup odd entity state related to communication with the
	  firmware
	- check if an entity has a job that started but not finished, if
	  so ban it (same mechanism as above)
	- resubmit all jobs from good entities

As long as you can do this without creating new dma_fence objects for 
the hw submissions everything should be fine.

	- start all schedulers (same as a entity in Xe because 1 to 1)

The implementation for this in the following file [1]. Search for the
drm scheduler functions and you should be able to find implementation
easily.

If you want to use an ordered work queue to avoid the stop / start dance
great do that, in Xe the stop / start dance works. I have extensively
tested this and the flow is rock solid and please trust me when I say
this as I worked on reset flows in the i915 and fought bugs for years, I
still don't think it is in the i915 because we try to do resubmission +
crazy races. Because of that I was incredibly paranoid when I
implemented this + tested it extensively.

I'll post an updated version of my DRM scheduler series [2] on the list
soon for the WQ changes, plus whatever else is required for Xe.

So my take from this discussion is maybe with a little documentation, we
are good. Thoughts?

For XE what you describe above basically sounds perfectly fine to me.

I strongly assume when you re-submit things you just tell your hw 
scheduler to pick up at the place it left of? E.g. set a read pointer 
and write pointer of a ring buffer appropriately?

Christian.

Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
[2] https://patchwork.freedesktop.org/series/116055/

Regards,
Christian.

Regards,

Boris