On 1/15/24 13:17, Christian König wrote:
Am 15.01.24 um 12:37 schrieb Joshua Ashton:
On 1/15/24 09:40, Christian König wrote:
Am 13.01.24 um 15:02 schrieb Joshua Ashton:
We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.
Big NAK to that approach, the karma handling is completely deprecated.
When you want to signal execution errors please use the fence error
code.
The fence error code does not result in ctx's being marked as guilty,
only the karma path does.
See drm_sched_increase_karma.
Are you proposing that we instead mark contexts as guilty with the
fence error ourselves here?
No, I'm proposing to completely abandon the concept of guilty contexts.
Basically what we should do is to return an error from the CS IOCTL
whenever a previous submission resulted in a fatal error as suggested by
Marek.
Oh, I agree that is broken by design, but this is already implemented
with the current guilt system!
The ioctls already will return -ECANCELLED if you are guilty of a hang:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c#L64
The query merely exists to give more feedback as to the situation, which
is fine.
That we query the context for guilty was just a design decision we
copied over from our closed source drivers which turned out to
absolutely not solving anything.
Marek can probably comment as well why the whole idea of querying the
kernel if something fatal happens instead of just rejecting submissions
is broken by design.
Without this feedback, the application may keep pushing through the
soft
recoveries, continually hanging the system with jobs that timeout.
Well, that is intentional behavior. Marek is voting for making soft
recovered errors fatal as well while Michel is voting for better
ignoring them.
I'm not really sure what to do. If you guys think that soft recovered
hangs should be fatal as well then we can certainly do this.
They have to be!
As Marek and I have pointed out, applications that hang or fault will
just hang or fault again, especially when they use things like draw
indirect, buffer device address, descriptor buffers, etc.
Ok, well then I now have two people (Marek and you) saying that soft
recovery should be fatal while Michel is saying that soft recovery being
non fatal improves stability for him :)
Should we somehow make that configurable or depend it on if that's the
display server or if it's an user application?
I could probably get every RADV developer, and all of the Proton
graphics team to come in and preach the same thing also. :P
If a compositor/display server is guilty of a hang, it can just listen
for DEVICE_LOST from vkQueueSubmit and re-create it's context, re-import
the dmabufs etc (or restart itself).
FWIU, in the email chain, the thing Daenzer was saying was that Firefox
was falling back to software rendering even when it was innocent, but
that seems incredibly broken by design.
- Joshie 🐸✨
Regards,
Christian.
The majority of apps these days have a lot of side effects and
persistence between frames and submissions.
- Joshie 🐸✨
Regards,
Christian.
There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.
With these, I was able to run Counter-Strike 2 and launch an
application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.
Signed-off-by: Joshua Ashton <joshua@xxxxxxxxx>
Cc: Friedrich Vock <friedrich.vock@xxxxxx>
Cc: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: André Almeida <andrealmeid@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct
amdgpu_ring *ring, struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);
+ if (job->vm)
+ drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)