On 15.01.24 11:26, Christian König wrote:
Am 14.01.24 um 14:00 schrieb Friedrich Vock:
If the IH ring buffer overflows, it's possible that fence signal events
were lost. Check each ring for progress to prevent job timeouts/GPU
hangs due to the fences staying unsignaled despite the work being done.
That's completely unnecessary and in some cases even harmful.
How is it harmful? The only effect it can have is prevent unnecessary
GPU hangs, no? It's not like it hides any legitimate errors that you'd
otherwise see.
We already have a timeout handler for that and overflows point to
severe system problem so they should never occur in a production system.
IH ring buffer overflows are pretty reliably reproducible if you trigger
a lot of page faults, at least on Deck. Why shouldn't enough page faults
in quick succession be able to overflow the IH ring buffer?
The fence fallback timer as it is now is useless for this because it
only gets triggered once after 0.5s. I guess an alternative approach
would be to make a timer trigger for each work item in flight every
0.5s, but why should that be better than just handling overflow errors
as they occur?
Regards,
Friedrich
Regards,
Christian.
Cc: Joshua Ashton <joshua@xxxxxxxxx>
Cc: Alex Deucher <alexander.deucher@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Friedrich Vock <friedrich.vock@xxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index f3b0aaf3ebc6..2a246db1d3a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -209,6 +209,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev,
struct amdgpu_ih_ring *ih)
{
unsigned int count;
u32 wptr;
+ int i;
if (!ih->enabled || adev->shutdown)
return IRQ_NONE;
@@ -227,6 +228,20 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)
ih->rptr &= ih->ptr_mask;
}
+ /* If the ring buffer overflowed, we might have lost some fence
+ * signal interrupts. Check if there was any activity so the signal
+ * doesn't get lost.
+ */
+ if (ih->overflow) {
+ for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+ struct amdgpu_ring *ring = adev->rings[i];
+
+ if (!ring || !ring->fence_drv.initialized)
+ continue;
+ amdgpu_fence_process(ring);
+ }
+ }
+
amdgpu_ih_set_rptr(adev, ih);
wake_up_all(&ih->wait_process);
--
2.43.0