Hi, Is this a preventive fix or you found errors/oops/hangs? If you had found errors/oops/hangs, can you please share the details? BR, Chandan V N >On 2022-06-21 03:25, Christian König wrote: >> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: >>> Problem: >>> After we start handling timed out jobs we assume there fences won't >>> be signaled but we cannot be sure and sometimes they fire late. We >>> need to prevent concurrent accesses to fence array from >>> amdgpu_fence_driver_clear_job_fences during GPU reset and >>> amdgpu_fence_process from a late EOP interrupt. >>> >>> Fix: >>> Before accessing fence array in GPU disable EOP interrupt and flush >>> all pending interrupt handlers for amdgpu device's interrupt line. >> >>> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 26 >>> ++++++++++++++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + >>> 3 files changed, 31 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 2b92281dd0c1..c99541685804 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct >>> amdgpu_device *adev, >>> amdgpu_virt_fini_data_exchange(adev); >>> } >>> + amdgpu_fence_driver_isr_toggle(adev, true); >>> + >>> /* block all schedulers and reset given job's ring */ >>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> struct amdgpu_ring *ring = adev->rings[i]; @@ -4620,6 >>> +4622,8 @@ int amdgpu_device_pre_asic_reset(struct >>> amdgpu_device *adev, >>> amdgpu_fence_driver_force_completion(ring); >>> } >>> + amdgpu_fence_driver_isr_toggle(adev, false); >>> + >>> if (job && job->vm) >>> drm_sched_increase_karma(&job->base); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>> index a9ae3beaa1d3..d6d54ba4c185 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct >>> amdgpu_device *adev) >>> } >>> } >>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, >>> bool stop) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < AMDGPU_MAX_RINGS; i++) { >>> + struct amdgpu_ring *ring = adev->rings[i]; >>> + >>> + if (!ring || !ring->fence_drv.initialized || >>> !ring->fence_drv.irq_src) >>> + continue; >>> + >>> + if (stop) >>> + amdgpu_irq_put(adev, ring->fence_drv.irq_src, >>> + ring->fence_drv.irq_type); >>> + else >>> + amdgpu_irq_get(adev, ring->fence_drv.irq_src, >>> + ring->fence_drv.irq_type); >> >> That won't work like this. This increments/decrements the reference >> count for the IRQ, but doesn't guarantee in any way that they are >> stopped/started. > > >I understand that, i just assumed that the fence driver is the only holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ? >I can disable amdgpu interrupt line totally using disable_irq - would this be better ? > > >> >> >>> + } >>> + >>> + /* TODO Only waits for irq handlers on other CPUs, maybe >>> local_irq_save >>> + * local_irq_local_irq_restore are needed here for local >>> interrupts ? >>> + * >>> + */ >> >> Well that comment made me smile. Think for a moment what the local CPU >> would be doing if an interrupt would run :) > > >No, I understand this of course, I am ok to be interrupted by interrupt handler at this point, what i am trying to do is to prevent amdgpu_fence_process to run concurrently with amdgpu_fence_driver_clear_job_fences - that is what this function is trying to prevent - i disable and >flush pending EOP ISR handlers before the call to clear fences and re-enable after. >I guess we can also introduce a spinlock to serialize them ? Yiqing reported seeing a race between them so we have to do something. > >Andrey > > >> >> Cheers, >> Christian. >> >>> + if (stop) >>> + synchronize_irq(adev->irq.irq); } >>> + >>> void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) >>> { >>> unsigned int i, j; >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> index 7d89a52091c0..82c178a9033a 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct >>> amdgpu_ring *ring, >>> uint32_t wait_seq, >>> signed long timeout); >>> unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring); >>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool >>> stop); >>> /* >>> * Rings. >>