Re: [PATCH] drm/amdgpu: Fix some unload driver issues

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 5 Mar 2021 09:51:38 +0100

Am 05.03.21 um 09:43 schrieb Deng, Emily:
[AMD Official Use Only - Internal Distribution Only]

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: Friday, March 5, 2021 3:55 PM
To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: Fix some unload driver issues

Am 05.03.21 um 02:20 schrieb Emily Deng:
When unloading driver after killing some applications, it will hit
sdma flush tlb job timeout which is called by ttm_bo_delay_delete. So
to avoid the job submit after fence driver fini, call
ttm_bo_lock_delayed_workqueue before fence driver fini. And also put
drm_sched_fini before waiting fence.

Good catch, Reviewed-by: Christian König <christian.koenig@xxxxxxx> for
this part.

Set adev->gart.ptr to null to fix null pointer when calling
amdgpu_gart_unbind in amdgpu_bo_fini which is after gart_fini.
For that one I'm wondering if we shouldn't rather rework or double check the
tear down order.

On the other hand the hardware should be idle by now (this comes after the
fence_driver_fini, doesn't it?) and it looks cleaner on it's own.
Yes, you are right, without consider memory leak, with above patch, the bo are all cleaned, then won't have issue. But if have memory leak, maybe it will have issue in ttm_bo_force_list_clean-> ttm_mem_evict_first?

Yeah, that is a good argument and part of what I mean with it looks 
cleaner on its own.

Maybe write that into the commit message like this. With that done the 
full patch has my rb.

Regards,
Christian.


Regards,
Christian.

Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 5 +++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 1 +
   3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a11760ec3924..de0597d34588 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3594,6 +3594,7 @@ void amdgpu_device_fini(struct amdgpu_device
*adev)
   {
   dev_info(adev->dev, "amdgpu: finishing device.\n");
   flush_delayed_work(&adev->delayed_init_work);
+ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
   adev->shutdown = true;

   kfree(adev->pci_state);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 143a14f4866f..6d16f58ac91e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -531,6 +531,8 @@ void amdgpu_fence_driver_fini(struct
amdgpu_device
*adev)

   if (!ring || !ring->fence_drv.initialized)
   continue;
+if (!ring->no_scheduler)
+drm_sched_fini(&ring->sched);
   r = amdgpu_fence_wait_empty(ring);
   if (r) {
   /* no need to trigger GPU reset as we are unloading */
@@ -539,8
+541,7 @@ void amdgpu_fence_driver_fini(struct amdgpu_device *adev)
   if (ring->fence_drv.irq_src)
   amdgpu_irq_put(adev, ring->fence_drv.irq_src,
          ring->fence_drv.irq_type);
-if (!ring->no_scheduler)
-drm_sched_fini(&ring->sched);
+
   del_timer_sync(&ring->fence_drv.fallback_timer);
   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
   dma_fence_put(ring->fence_drv.fences[j]);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index 23823a57374f..f1ede4b43d07 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -202,6 +202,7 @@ void amdgpu_gart_table_vram_free(struct
amdgpu_device *adev)
   return;
   }
   amdgpu_bo_unref(&adev->gart.bo);
+adev->gart.ptr = NULL;
   }

   /*

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx