Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Christian König <christian.koenig@xxxxxxx> · Wed, 3 May 2023 17:01:50 +0200

Am 03.05.23 um 15:10 schrieb Lucas Stach:
Am Mittwoch, dem 03.05.2023 um 13:40 +0200 schrieb Christian König:
Hi Lucas,

Am 03.05.23 um 12:28 schrieb Lucas Stach:
Hi Christian,

Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
Adding Luben as well.

Am 03.05.23 um 10:16 schrieb Boris Brezillon:
[SNIP]
To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
After the discussion I had with Matthew yesterday on IRC, I
realized there was no clear agreement on this. Matthew uses those 3
helpers in the Xe driver right now, and given he intends to use a
multi-threaded wq for its 1:1 schedulers run queue, there's no way he
can get away without calling drm_sched_{start,stop}().
drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
wondering if it wouldn't be preferable to add a ::resubmit_job() method
or extend the ::run_job() one to support the resubmit semantics, which,
AFAIU, is just about enforcing the job done fence (the one returned by
::run_job()) doesn't transition from a signaled to an unsignaled state.

But probably more important than providing a generic helper, we should
document the resubmit semantics (AKA, what should and/or shouldn't be
done with pending jobs when a recovery happens). Because forbidding
people to use a generic helper function doesn't give any guarantee that
they'll do the right thing when coding their own logic, unless we give
clues about what's considered right/wrong, and the current state of the
doc is pretty unclear in this regard.
I should probably talk about the history of the re-submit feature a bit
more.

Basically AMD came up with re-submission as a cheap way of increasing
the reliability of GPU resets. Problem is that it turned into an
absolutely nightmare. We tried for the last 5 years or so to get that
stable and it's still crashing.

The first and most major problem is that the kernel doesn't even has the
information if re-submitting jobs is possible or not. For example a job
which has already been pushed to the hw could have grabbed a binary
semaphore and re-submitting it will just wait forever for the semaphore
to be released.

I can follow this argument, but concluding that job resubmission is
impossible is punishing simple GPUs. On Vivante GPUs we have exactly
one job running at a time and all dependencies are visible to the
scheduler, as we don't have/use any hardware synchronization mechanism,
so all synchronization is piped through kernel visible fences.

It's reasonably easy for the etnaviv driver to find the guilty job to
skip but resubmit all other jobs in the current hardware queue. I'm not
really fond of having to make all applications deal with innocent
context resets, while we can solve this via resubmission on simple HW.

I know that more complex hardware and use-cases might still require the
kernel driver for this HW to give up and shoot all contexts active at
the time of the GPU reset, but that's the price you pay for the
hardware being more capable. I don't see why we should also pay that
price on really simple HW.
You can still re-create the hw state inside your driver to continue work
from some point if know that this will work.

As I wrote below the scheduler component can even provide help with with
that in the form of providing all the unsignaled hw or scheduler fences
for example.

But what we absolutely should *not* do is to have this re-submission
feature, because that requires re-using the dma_fence objects. In other
words this dance with detaching the scheduler fence from the hw fence
and attach a new one is what absolutely doesn't work.

The second problem is that the dma_fence semantics don't allow to ever
transit the state of a fence from signaled back to unsignaled. This
means that you can't re-use the hw fence and need to allocate a new one,
but since memory allocation is forbidden inside a reset handler as well
(YES we need to better document that part) you actually need to keep a
bunch of hw fences pre-allocated around to make this work. Amdgpu choose
to illegally re-use the hw fence instead which only works with quite
extreme hacks.

I'm with Boris here. Could you please explain when a fence would be
already signaled in a GPU reset scenario and would need to go back to
unsignaled, so we are on the same page here?
Take a look at how this re-submission feature of the scheduler works.
The approach is basically that you detach the hw fence from the
scheduler fence and then attach a new one.

Right, but this shouldn't be a problem, as long as the old fence isn't
signaled yet, right? It becomes a problem when the GPU reset and fence
signaling are racing each other, due to insufficient hardware/software
state synchronization.

Exactly that.

I'm sure that the necessary synchronization can be hard to get right,
but it's not the act of switching one unsignaled fence to a new one or
reusing the old unsignaled fence that's causing problems, but the
complications of making sure that old fences don't signal after the
timeout detection, right?

Well sort of, switching the fences is the root of the problem.

Basically the initial hw fence for a submission should never signal 
until the hw is done with that submission. Either completed it or 
aborted it.

The concept that the GPU reset aborts the existing fences then re-submit 
which gives you new hw fences is what doesn't work.

Either you re-use the old hw fence (in which case you wouldn't have to 
switch it in the first place) or you allocate a new one (which violates 
the no-memory allocation requirements).

To be clear: I'm not asking those questions because I think I know any
better, but because I'm actually not sure and would like to understand
your line of thought and background information when you say "this
dance with detaching the scheduler fence from the hw fence and attach a
new one is what absolutely doesn't work" above.

And that is really appreciated, we don't have enough people looking into 
this and especially pushing back on bad ideas.

Driver writers need to understand the limitations of the current
resubmission scheduler code to do better in their driver
implementation, otherwise we just end up with (worse) open-coded
variations of that code in each driver.

Yes, completely agree.

Regards,
Christian.

Regards,
Lucas