On 16.10.2023 13:03, Christian König wrote:
Am 13.10.23 um 16:34 schrieb Karolina Stolarek:
In rare cases, a delayed destruction of a BO with a system resource
could stay in the workqueue until drain_workqueue() is called in
ttm_device_fini(). An attempt to free a resource from an already
released manager results in NULL pointer dereference. Move the step
of draining and destroying the workqueue so it happens before the
ttm_sys_manager cleanup.
Good catch! But it's not only the drainage of the workqueue which is
at the wrong place here.
Removing the device from the device list should come before
destroying the system domain as well.
I wonder when this would cause problems. But of course, it makes sense
to fix both with one patch.
So I think you should probably rather move the chunk:
man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
ttm_resource_manager_set_used(man, false);
ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
after the destroy_workqueue() instead of the other way around.>
Apart from that looks good to me.
Thanks for taking a look and the suggestion, I'll send a v2 once I
finish testing the change. Also, I'll drop the Fixes tag, as it seems
the order issue has been there since the very beginning.
All the best,
Karolina
Thanks, Christian.
Fixes: 9bff18d13473 ("drm/ttm: use per BO cleanup workers")
Signed-off-by: Karolina Stolarek <karolina.stolarek@xxxxxxxxx> ---
Some background: I stumbled upon this issue when testing
ttm_bo_pipeline_gutting() with BO with an active dma_resv fence.
In ~2% of the runs, the delayed destruction of the ghost wouldn't
happen until the drain_queue() step. man->func->free(man, *res)
got called via ttm_bo_cleanup_memtype_use(), the manager and its
functions were nowhere to be seen, resulting in a nulptr deref.
drivers/gpu/drm/ttm/ttm_device.c | 6 +++--- 1 file changed, 3
insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c
b/drivers/gpu/drm/ttm/ttm_device.c index
7726a72befc5..753126581620 100644 ---
a/drivers/gpu/drm/ttm/ttm_device.c +++
b/drivers/gpu/drm/ttm/ttm_device.c @@ -232,6 +232,9 @@ void
ttm_device_fini(struct ttm_device *bdev) struct
ttm_resource_manager *man; unsigned i; + drain_workqueue(bdev->wq);
+ destroy_workqueue(bdev->wq); + man = ttm_manager_type(bdev,
TTM_PL_SYSTEM); ttm_resource_manager_set_used(man, false);
ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); @@ -240,9
+243,6 @@ void ttm_device_fini(struct ttm_device *bdev)
list_del(&bdev->device_list); mutex_unlock(&ttm_global_mutex); -
drain_workqueue(bdev->wq); - destroy_workqueue(bdev->wq); -
spin_lock(&bdev->lru_lock); for (i = 0; i < TTM_MAX_BO_PRIORITY;
++i) if (list_empty(&man->lru[0]))