Am 13.10.23 um 16:34 schrieb Karolina Stolarek:
In rare cases, a delayed destruction of a BO with a system resource
could stay in the workqueue until drain_workqueue() is called
in ttm_device_fini(). An attempt to free a resource from an already
released manager results in NULL pointer dereference. Move the step
of draining and destroying the workqueue so it happens before
the ttm_sys_manager cleanup.
Good catch! But it's not only the drainage of the workqueue which is at
the wrong place here.
Removing the device from the device list should come before destroying
the system domain as well.
So I think you should probably rather move the chunk:
man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
ttm_resource_manager_set_used(man, false);
ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
after the destroy_workqueue() instead of the other way around.
Apart from that looks good to me.
Thanks,
Christian.
Fixes: 9bff18d13473 ("drm/ttm: use per BO cleanup workers")
Signed-off-by: Karolina Stolarek <karolina.stolarek@xxxxxxxxx>
---
Some background: I stumbled upon this issue when testing
ttm_bo_pipeline_gutting() with BO with an active dma_resv fence. In ~2% of
the runs, the delayed destruction of the ghost wouldn't happen until the
drain_queue() step. man->func->free(man, *res) got called via
ttm_bo_cleanup_memtype_use(), the manager and its functions were nowhere to
be seen, resulting in a nulptr deref.
drivers/gpu/drm/ttm/ttm_device.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 7726a72befc5..753126581620 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -232,6 +232,9 @@ void ttm_device_fini(struct ttm_device *bdev)
struct ttm_resource_manager *man;
unsigned i;
+ drain_workqueue(bdev->wq);
+ destroy_workqueue(bdev->wq);
+
man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
ttm_resource_manager_set_used(man, false);
ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
@@ -240,9 +243,6 @@ void ttm_device_fini(struct ttm_device *bdev)
list_del(&bdev->device_list);
mutex_unlock(&ttm_global_mutex);
- drain_workqueue(bdev->wq);
- destroy_workqueue(bdev->wq);
-
spin_lock(&bdev->lru_lock);
for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
if (list_empty(&man->lru[0]))