[Public] Hi Andrey and Christian, I just send out a new patch to address this, I am not sure if I understand your point correctly. Please review. The patch is to stop scheduler in fence_hw_fini and start the scheduler in fence_hw_init. Regards, Guchun -----Original Message----- From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Sent: Monday, August 23, 2021 10:42 PM To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>; Alex Deucher <alexdeucher@xxxxxxxxx>; Mike Lothian <mike@xxxxxxxxxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Cc: amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Gao, Likun <Likun.Gao@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2) On 2021-08-23 2:50 a.m., Christian König wrote: > Good mornings guys, > > Andrey has a rather valid concern here, but I think we need to > approach this from a more high level view. > > When hw_fini is called we should make sure that the scheduler can't > submit any more work to the hardware, because the hw is finalized and > not expected to response any more. > > As far as I can see the cleanest approach would be to stop the > scheduler in hw_fini and fully clean it up in sw_fini. That would also > fit quite nicely with how GPU reset is supposed to work I think. > > Problem is that this is currently done outside of the fence code for > the at least the reset case, so before we restructure that we need to > stick with what we have. > > Andrey do you think it would be any problem if we stop the scheduler > manually in the hot plug case as well? As long as it's 'parked' inside HW fini - meaning the thread submitting to HW is done I think it should cover hot unplug as well. Andrey > > Thanks, > Christian. > > Am 23.08.21 um 08:36 schrieb Chen, Guchun: >> [Public] >> >> Hi Andrey, >> >> Thanks for your notice. The cause why moving drm_sched_fini to >> sw_fini is it's a SW behavior and part of SW shutdown, so hw_fini >> should not touch it. But if the race, that scheduler on the ring >> possibly keeps submitting jobs which causes un-empty ring is there, >> possibly we still need to call drm_sched_fini first in hw_fini to >> stop job submission first. >> >> @Koenig, Christian what's your opinion? >> >> Regards, >> Guchun >> >> -----Original Message----- >> From: Alex Deucher <alexdeucher@xxxxxxxxx> >> Sent: Friday, August 20, 2021 2:13 AM >> To: Mike Lothian <mike@xxxxxxxxxxxxxx> >> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun >> <Guchun.Chen@xxxxxxx>; amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; >> Gao, Likun <Likun.Gao@xxxxxxx>; Koenig, Christian >> <Christian.Koenig@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; >> Deucher, Alexander <Alexander.Deucher@xxxxxxx> >> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver >> fini in s3 test (v2) >> >> Please go ahead. Thanks! >> >> Alex >> >> On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian <mike@xxxxxxxxxxxxxx> >> wrote: >>> Hi >>> >>> Do I need to open a new bug report for this? >>> >>> Cheers >>> >>> Mike >>> >>> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky >>> <andrey.grodzovsky@xxxxxxx> wrote: >>>> >>>> On 2021-08-02 1:16 a.m., Guchun Chen wrote: >>>>> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to >>>>> stop scheduler in s3 test, otherwise, fence related failure will >>>>> arrive after resume. To fix this and for a better clean up, move >>>>> drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part >>>>> of driver shutdown, and should never be called in hw_fini. >>>>> >>>>> v2: rename amdgpu_fence_driver_init to >>>>> amdgpu_fence_driver_sw_init, to keep sw_init and sw_fini paired. >>>>> >>>>> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable >>>>> sequence >>>>> Suggested-by: Christian König <christian.koenig@xxxxxxx> >>>>> Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 ++--- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 12 +++++++----- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++-- >>>>> 3 files changed, 11 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index b1d2dc39e8be..9e53ff851496 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device >>>>> *adev, >>>>> >>>>> fence_driver_init: >>>>> /* Fence driver */ >>>>> - r = amdgpu_fence_driver_init(adev); >>>>> + r = amdgpu_fence_driver_sw_init(adev); >>>>> if (r) { >>>>> - dev_err(adev->dev, "amdgpu_fence_driver_init >>>>> failed\n"); >>>>> + dev_err(adev->dev, "amdgpu_fence_driver_sw_init >>>>> +failed\n"); >>>>> amdgpu_vf_error_put(adev, >>>>> AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0); >>>>> goto failed; >>>>> } >>>>> @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device >>>>> *dev, bool fbcon) >>>>> } >>>>> amdgpu_fence_driver_hw_init(adev); >>>>> >>>>> - >>>>> r = amdgpu_device_ip_late_init(adev); >>>>> if (r) >>>>> return r; >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> index 49c5c7331c53..7495911516c2 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct >>>>> amdgpu_ring *ring, >>>>> } >>>>> >>>>> /** >>>>> - * amdgpu_fence_driver_init - init the fence driver >>>>> + * amdgpu_fence_driver_sw_init - init the fence driver >>>>> * for all possible rings. >>>>> * >>>>> * @adev: amdgpu device pointer @@ -509,13 +509,13 @@ int >>>>> amdgpu_fence_driver_init_ring(struct >>>>> amdgpu_ring *ring, >>>>> * amdgpu_fence_driver_start_ring(). >>>>> * Returns 0 for success. >>>>> */ >>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev) >>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev) >>>>> { >>>>> return 0; >>>>> } >>>>> >>>>> /** >>>>> - * amdgpu_fence_driver_fini - tear down the fence driver >>>>> + * amdgpu_fence_driver_hw_fini - tear down the fence driver >>>>> * for all possible rings. >>>>> * >>>>> * @adev: amdgpu device pointer @@ -531,8 +531,7 @@ void >>>>> amdgpu_fence_driver_hw_fini(struct >>>>> amdgpu_device *adev) >>>>> >>>>> if (!ring || !ring->fence_drv.initialized) >>>>> continue; >>>>> - if (!ring->no_scheduler) >>>>> - drm_sched_fini(&ring->sched); >>>>> + >>>>> /* You can't wait for HW to signal if it's gone */ >>>>> if (!drm_dev_is_unplugged(&adev->ddev)) >>>>> r = amdgpu_fence_wait_empty(ring); >>>> >>>> Sorry for late notice, missed this patch. By moving drm_sched_fini >>>> past amdgpu_fence_wait_empty a race is created as even after you >>>> waited for all fences on the ring to signal the sw scheduler will >>>> keep submitting new jobs on the ring and so the ring won't stay empty. >>>> >>>> For hot device removal also we want to prevent any access to HW >>>> past PCI removal in order to not do any MMIO accesses inside the >>>> physical MMIO range that no longer belongs to this device after >>>> it's removal by the PCI core. Stopping all the schedulers prevents >>>> any MMIO accesses done during job submissions and that why >>>> drm_sched_fini was done as part of amdgpu_fence_driver_hw_fini and >>>> not amdgpu_fence_driver_sw_fini >>>> >>>> Andrey >>>> >>>>> @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct >>>>> amdgpu_device *adev) >>>>> if (!ring || !ring->fence_drv.initialized) >>>>> continue; >>>>> >>>>> + if (!ring->no_scheduler) >>>>> + drm_sched_fini(&ring->sched); >>>>> + >>>>> for (j = 0; j <= ring->fence_drv.num_fences_mask; >>>>> ++j) dma_fence_put(ring->fence_drv.fences[j]); >>>>> kfree(ring->fence_drv.fences); diff --git >>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> index 27adffa7658d..9c11ced4312c 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> @@ -106,7 +106,6 @@ struct amdgpu_fence_driver { >>>>> struct dma_fence **fences; >>>>> }; >>>>> >>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev); >>>>> void amdgpu_fence_driver_force_completion(struct amdgpu_ring >>>>> *ring); >>>>> >>>>> int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, @@ >>>>> -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring(struct >>>>> amdgpu_ring *ring, >>>>> int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring, >>>>> struct amdgpu_irq_src *irq_src, >>>>> unsigned irq_type); >>>>> +void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev); >>>>> void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev); >>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev); >>>>> void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev); >>>>> -void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev); >>>>> int amdgpu_fence_emit(struct amdgpu_ring *ring, struct >>>>> dma_fence **fence, >>>>> unsigned flags); >>>>> int amdgpu_fence_emit_polling(struct amdgpu_ring *ring, >>>>> uint32_t *s, >