Re: [PATCH] drm/ttm: Schedule delayed_delete worker closer

Christian König <christian.koenig@xxxxxxx> · Wed, 8 Nov 2023 09:36:50 +0100

Am 07.11.23 um 20:45 schrieb Rajneesh Bhardwaj:
When a TTM BO is getting freed, to optimize the clearing operation on
the workqueue, schedule it closer to a NUMA node where the memory was
allocated. This avoids the cases where the ttm_bo_delayed_delete gets
scheduled on the CPU cores that are across interconnect boundaries such
as xGMI, PCIe etc.

This needs more background and doesn't mention that we now try to 
allocate the memory close to the device.

Something like this here should work:

Try to allocate system memory on the NUMA node the device is closest to 
and try to run delayed delete workers on a CPU of this node as well.

The background of running the delayed delete worker on a NUMA node close 
to the one of the initial allocation is that the memory might be cleared 
on free by the core memory management and that should probably be done 
on a CPU close to it.


This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
APU platforms such as GFXIP9.4.3.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@xxxxxxx>
---
  drivers/gpu/drm/ttm/ttm_bo.c     | 10 +++++++++-
  drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
  2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 5757b9415e37..0d608441a112 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
  			spin_unlock(&bo->bdev->lru_lock);
  
  			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
-			queue_work(bdev->wq, &bo->delayed_delete);
+			/* Schedule the worker on the closest NUMA node,

  if no
+			 * CPUs are available, this falls back to any CPU core
+			 * available system wide.

Mentioning that is superfluous since everybody can look at the 
implementation and that a fallback is available for a function which 
doesn't return an error is obvious.

  This helps avoid the
+			 * bottleneck to clear memory in cases where the worker
+			 * is scheduled on a CPU which is remote to the node
+			 * where the memory is getting freed.
+			 */

Rather write something like "This improves performance since system 
memory might be cleared on free and that is best done on a CPU core 
close to it."

Regards,
Christian.

+
+			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
  			return;
  		}
  
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 43e27ab77f95..72b81a2ee6c7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
  	bdev->funcs = funcs;
  
  	ttm_sys_man_init(bdev);
-	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
+
+	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
  
  	bdev->vma_manager = vma_manager;
  	spin_lock_init(&bdev->lru_lock);