Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/15/24 13:19, Christian König wrote:
Am 15.01.24 um 12:54 schrieb Joshua Ashton:
[SNIP]

The question here is really if we should handled soft recovered errors as fatal or not. Marek is in pro of that Michel is against it.

Figure out what you want in userspace and I'm happy to implement it :)


(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

That is actually intended behavior. When something disrupted the GPU execution and the application is affected it is mandatory to forward this error to the application.

No. If said context was entirely unaffected, then it should be completely transparent to the application.

Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific VMID for HangApp's submissions and signals the submission fence.

In this instance, the Gamescope and Counter-Strike 2 ctxs are completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no impact to their work.

Ok, that is something I totally agree on. But why would the Gamescope and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery in the first place?

IIRC a soft recovery doesn't increment the reset counter in any way. So they should never be affected.

It does, and without assigning any guilt, amdgpu_ring_soft_recovery calls atomic_inc(&ring->adev->gpu_reset_counter).



Regards,
Christian.


Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with HangApp, FWIU the way that the clear-out works being vmid specific means that they would be unaffected, right?

- Joshie 🐸✨


Regards,
Christian.


- Joshie 🐸✨


Signed-off-by: Joshua Ashton <joshua@xxxxxxxxx>

Cc: Friedrich Vock <friedrich.vock@xxxxxx>
Cc: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: André Almeida <andrealmeid@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job)
          dma_fence_set_error(fence, -ENODATA);
      spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+        drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
      while (!dma_fence_is_signaled(fence) &&
             ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)





- Joshie 🐸✨



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux