Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Friedrich Vock <friedrich.vock@xxxxxx> · Sat, 13 Jan 2024 15:24:16 +0100

On 13.01.24 15:02, Joshua Ashton wrote:
We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton <joshua@xxxxxxxxx>
Tested-by: Friedrich Vock <friedrich.vock@xxxxxx>

Cc: Friedrich Vock <friedrich.vock@xxxxxx>
Cc: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: André Almeida <andrealmeid@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job)
  		dma_fence_set_error(fence, -ENODATA);
  	spin_unlock_irqrestore(fence->lock, flags);

+	if (job->vm)
+		drm_sched_increase_karma(&job->base);
  	atomic_inc(&ring->adev->gpu_reset_counter);
  	while (!dma_fence_is_signaled(fence) &&
  	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)