Am 10.01.25 um 08:37 schrieb Gerry Liu:
2025年1月10日 14:51,Christian König <christian.koenig@xxxxxxx> 写道:
Am 10.01.25 um 03:08 schrieb Jiang Liu:
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function
amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu <gerry@xxxxxxxxxxxxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f29a4ad3c6d0..3688e9eb949b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
ring->name);
+ ring->sched.ops = NULL;
return r;
}
r = amdgpu_uvd_entity_init(adev, ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf..b5e87b515139 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
* The natural check would be sched.ready, which is
* set as drm_sched_init() finishes...
*/
- if (ring->sched.ops)
+ if (ring->sched.ops) {
drm_sched_fini(&ring->sched);
+ ring->sched.ops = NULL;
+ }
As said in my previous reply we should really stop checking sched.ops here.
The scheduler should not be cleaned up by this function in the first place.
Hi Christian,
The current workflow is as below:
amdgpu_device_init
amdgpu_fence_driver_sw_init
amdgpu_device_ip_init
amdgpu_device_init_schedulers
drm_sched_init
amdgpu_fence_driver_hw_init
amdgpu_device_fini_hw
amdgpu_fence_driver_hw_fini
amdgpu_device_fini_sw
amdgpu_device_ip_fini
amdgpu_fence_driver_sw_fini
drm_sched_fini
As you have mentioned, we should introduce amdgpu_device_fini_schedulers() and gets it called by amdgpu_device_ip_fini().
But I have no idea about why currently drm_sched_fini() is called by amdgpu_fence_driver_sw_fini() and whether it’s safe to move it into amdgpu_device_ip_fini().
If I remember correctly we just put the drm_sched_fini() with the ops
check as cheap way of cleaning up all schedulers before we had the
ring->no_scheduler flag which indicates if the scheduler is initialized
or not in the first place.
So it should be save to move the drm_sched_fini() into
amdgpu_device_ip_fini() as well. Just check ring->no_scheduler instead
of ring->sched.ops to decide if the scheduler needs to be cleaned up or not.
Regards,
Christian.
Thanks,
Gerry
Regards,
Christian.
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);