[AMD Official Use Only - General]
Ping.
Hi @Koenig, Christian and @Grodzovsky, Andrey,
We found some reset related issues during stress test on the
sequence. Please help give some comments.
Thanks,
Victor
-----Original Message-----
From: Victor Zhao <Victor.Zhao@xxxxxxx>
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey
<Andrey.Grodzovsky@xxxxxxx>; Zhao, Victor <Victor.Zhao@xxxxxxx>
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
[background]
For a gpu recovery caused by a hang on one ring (e.g. compute),
jobs from another ring (e.g. gfx) may continue signaling during
drm_sched_stop stage. The signal bit will not be cleared.
At the resubmit stage after recovery, the job with hw fence
signaled bit set will call job done directly instead go through
fence process.
This makes the hw_rq_count decrease but rcu fence pointer not
cleared yet.
Then overflow happens in the fence driver slots and some jobs may
be skipped and leave the rcu pointer not cleared which makes an
infinite wait for the slot on the next fence emitted.
This infinite wait cause a job timeout on the emitting job. And
driver will stuck at the its sched stop step because kthread_park
cannot be done.
[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens
Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
+++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6
+-----
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
amdgpu_device *adev,
amdgpu_virt_fini_data_exchange(adev);
}
- amdgpu_fence_driver_isr_toggle(adev, true);
-
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6
+5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
amdgpu_device_ip_need_full_reset(tmp_adev))
amdgpu_ras_suspend(tmp_adev);
+ amdgpu_fence_driver_isr_toggle(tmp_adev, true);
+
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = tmp_adev->rings[i];
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
atomic_inc(&tmp_adev->gpu_reset_counter);
}
- if (need_emergency_restart)
+ if (need_emergency_restart) {
+ list_for_each_entry (tmp_adev, device_list_handle,
reset_list) {
+ amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+ }
goto skip_sched_resume;
+ }
/*
* Must check guilty signal here since after this point all
old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
if (job && dma_fence_is_signaled(&job->hw_fence)) {
job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled,
skipping HW
reset");
+ list_for_each_entry (tmp_adev, device_list_handle,
reset_list) {
+ amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+ }
goto skip_hw_reset;
}
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
if (r && r == -EAGAIN) {
set_bit(AMDGPU_SKIP_MODE2_RESET,
&reset_context->flags);
adev->asic_reset_res = 0;
+ amdgpu_fence_driver_isr_toggle(adev, true);
goto retry;
}
}
@@ -5711,6 +5719,8 @@ pci_ers_result_t
amdgpu_pci_slot_reset(struct pci_dev *pdev)
set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
+ amdgpu_fence_driver_isr_toggle(adev, true);
+
adev->no_hw_access = true;
r = amdgpu_device_pre_asic_reset(adev, &reset_context);
adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring
*ring)
if (unlikely(seq == last_seq))
return false;
- last_seq &= drv->num_fences_mask;
- seq &= drv->num_fences_mask;
-
do {
struct dma_fence *fence, **ptr;
++last_seq;
- last_seq &= drv->num_fences_mask;
- ptr = &drv->fences[last_seq];
+ ptr = &drv->fences[last_seq & drv->num_fences_mask];
/* There is always exactly one thread signaling this
fence slot */
fence = rcu_dereference_protected(*ptr, 1);