Re: [PATCH v3 2/2] drm/xe/lnl: Offload system clear page activity to GPU

Nirmoy Das <nirmoy.das@xxxxxxxxxxxxxxx> · Thu, 27 Jun 2024 19:04:19 +0200

On 6/27/2024 9:55 AM, Thomas Hellström wrote:
Hi Nirmoy

On Mon, 2024-06-24 at 16:14 +0200, Nirmoy Das wrote:
On LNL because of flat CCS, driver creates a migrate job to clear
CCS meta data. Extend that to also clear system pages using GPU.
Inform TTM to allocate pages without __GFP_ZERO to avoid double page
clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set
TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's
clearn-on-free as XE now takes care of clearing pages.

To test the patch, I created a small test that tries to submit a job
after binding various sizes of buffer which shows good gains for
larger
buffer. For lower buffer sizes, the result is not very reliable as
the
results vary a lot.
Some concerns below,

also a big security concern.

The CCS clearing occurs when the bo is moved to TT. But there are
situations in which the bo is created and populated in system. For
example if the bo is created using
DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING, and then mmap'd Then it won't get
cleared. Since we don't have a dma_mapping of the bo at that time we
must revert to cpu clear when / that happens.

Thanks for pointing this out. I completely missed the fact  that BO can 
be accessed only by CPU.



With the patch
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719237905 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 9493 us
Time taken for size SZ_2M: 5503 us
Time taken for size SZ_64M: 13016 us
Time taken for size SZ_128M: 29464 us
Time taken for size SZ_256M: 38408 us
Time taken for size SZ_1G: 148758 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 3889 us
Time taken for size SZ_2M: 6091 us
Time taken for size SZ_64M: 20920 us
Time taken for size SZ_128M: 32394 us
Time taken for size SZ_256M: 61710 us
Time taken for size SZ_1G: 215437 us
Subtest basic-store-benchmark: SUCCESS (0.589s)

With the patch:
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719238062 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 11803 us
Time taken for size SZ_2M: 4237 us
Time taken for size SZ_64M: 8649 us
Time taken for size SZ_128M: 14682 us
Time taken for size SZ_256M: 22156 us
Time taken for size SZ_1G: 74457 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 5129 us
Time taken for size SZ_2M: 12563 us
Time taken for size SZ_64M: 14860 us
Time taken for size SZ_128M: 26064 us
Time taken for size SZ_256M: 47167 us
Time taken for size SZ_1G: 170304 us
Subtest basic-store-benchmark: SUCCESS (0.417s)

With the patch and init_on_alloc=0
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719238219 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 4803 us
Time taken for size SZ_2M: 9212 us
Time taken for size SZ_64M: 9643 us
Time taken for size SZ_128M: 13479 us
Time taken for size SZ_256M: 22429 us
Time taken for size SZ_1G: 83110 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 4003 us
Time taken for size SZ_2M: 4443 us
Time taken for size SZ_64M: 12960 us
Time taken for size SZ_128M: 13741 us
Time taken for size SZ_256M: 26841 us
Time taken for size SZ_1G: 84746 us
Subtest basic-store-benchmark: SUCCESS (0.290s)

v2: Handle regression on dgfx(Himal)
     Update commit message as no ttm API changes needed.
v3: Fix Kunit test.

Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@xxxxxxxxx>
Cc: Matthew Auld <matthew.auld@xxxxxxxxx>
Cc: "Thomas Hellström" <thomas.hellstrom@xxxxxxxxxxxxxxx>
Signed-off-by: Nirmoy Das <nirmoy.das@xxxxxxxxx>
---
  drivers/gpu/drm/xe/xe_bo.c           | 11 +++++++++++
  drivers/gpu/drm/xe/xe_device.c       |  7 +++++++
  drivers/gpu/drm/xe/xe_device_types.h |  2 ++
  drivers/gpu/drm/xe/xe_migrate.c      |  5 +++--
  4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 65c696966e96..a9ce4347a7d7 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -399,6 +399,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct
ttm_buffer_object *ttm_bo,
  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct
ttm_tt *tt,
  			      struct ttm_operation_ctx *ctx)
  {
+	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
  	int err;
  
  	/*
@@ -408,6 +409,10 @@ static int xe_ttm_tt_populate(struct ttm_device
*ttm_dev, struct ttm_tt *tt,
  	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
  		return 0;
  
+	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear
pages */
+	if (xe->mem.gpu_page_clear)
+		tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
+
  	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
  	if (err)
  		return err;
@@ -417,9 +422,15 @@ static int xe_ttm_tt_populate(struct ttm_device
*ttm_dev, struct ttm_tt *tt,
  
  static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct
ttm_tt *tt)
  {
+	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
+
  	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
  		return;
  
+	/* Add TTM_TT_FLAG_CLEARED_ON_FREE when GPU is set to clear
pages */
+	if (xe->mem.gpu_page_clear)
+		tt->page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE;
+
  	xe_tt_unmap_sg(tt);
  
  	return ttm_pool_free(&ttm_dev->pool, tt);
diff --git a/drivers/gpu/drm/xe/xe_device.c
b/drivers/gpu/drm/xe/xe_device.c
index ca5e8435485a..c9afff1d0734 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe)
  	if (err)
  		goto err_irq_shutdown;
  
+	/**
+	 * On iGFX device with flat CCS, we clear CCS metadata,
let's extend that
+	 * and use GPU to clear pages as well.
+	 */
+	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe))
+		xe->mem.gpu_page_clear = true;
+
  	err = xe_vram_probe(xe);
  	if (err)
  		goto err_irq_shutdown;
diff --git a/drivers/gpu/drm/xe/xe_device_types.h
b/drivers/gpu/drm/xe/xe_device_types.h
index c37be471d11c..ece68c6f3668 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -325,6 +325,8 @@ struct xe_device {
  		struct xe_mem_region vram;
  		/** @mem.sys_mgr: system TTM manager */
  		struct ttm_resource_manager sys_mgr;
+		/** @gpu_page_clear: clear pages offloaded to GPU */
+		bool gpu_page_clear;
  	} mem;
  
  	/** @sriov: device level virtualization data */
diff --git a/drivers/gpu/drm/xe/xe_migrate.c
b/drivers/gpu/drm/xe/xe_migrate.c
index 05f933787860..61bf3d80e896 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct
xe_migrate *m,
  	struct xe_gt *gt = m->tile->primary_gt;
  	struct xe_device *xe = gt_to_xe(gt);
  	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) &&
!IS_DGFX(xe)) ? true : false;
+	bool clear_on_create = xe->mem.gpu_page_clear ||
!clear_system_ccs;
Hm. In what situation is clear_on_create false?
On igfx < xe2
Previously we had clear_system_ccs to clear *only* ccs, but now that
situation has changed to clear_also_ccs?

I think the xe_migrate_clear should not bother whether this is an
initial clearing or a clearing for other reasons, but rather be passed
two flags

"clear_bo_data" and "clear_ccs" or something similar, and also we
should avoid variable names that refer to usage by higher layers of
which the migrate code should have no knowledge.

v3 was to  fix kunit test which should be fix with a double param that 
you are suggesting.

Thanks for looking into this. I will resend with above suggestions and 
after some more testing.


Thanks,

Nirmoy


/Thomas


  	struct dma_fence *fence = NULL;
  	u64 size = bo->size;
  	struct xe_res_cursor src_it;
@@ -1032,7 +1033,7 @@ struct dma_fence *xe_migrate_clear(struct
xe_migrate *m,
  		batch_size = 2 +
  			pte_update_size(m, clear_vram, src, &src_it,
  					&clear_L0, &clear_L0_ofs,
&clear_L0_pt,
-					clear_system_ccs ? 0 :
emit_clear_cmd_len(gt), 0,
+					!clear_on_create ? 0 :
emit_clear_cmd_len(gt), 0,
  					avail_pts);
  
  		if (xe_device_has_flat_ccs(xe))
@@ -1060,7 +1061,7 @@ struct dma_fence *xe_migrate_clear(struct
xe_migrate *m,
  		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
  		update_idx = bb->len;
  
-		if (!clear_system_ccs)
+		if (clear_on_create)
  			emit_clear(gt, bb, clear_L0_ofs, clear_L0,
XE_PAGE_SIZE, clear_vram);
  
  		if (xe_device_has_flat_ccs(xe)) {