Re: [PATCH 1/2] dma-fence: Add a single fence fast path for fence merging

Christian König <christian.koenig@xxxxxxx> · Thu, 14 Nov 2024 09:53:49 +0100

Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>

Testing some workloads in two different scenarios, such as games running
under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
that in a significant portion of calls the dma_fence_unwrap_merge helper
is called with just a single unsignalled fence.

Therefore it is worthile to add a fast path for that case and so bypass
the memory allocation and insertion sort attempts.

You should probably re-order the patches since we need to backport the 
second as bug fix while the first is just an improvement.

There is also a bug in this patch which needs to be fixed.

Tested scenarios:

1) Hogwarts Legacy under Gamescope

450 calls per second to __dma_fence_unwrap_merge.

Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:

    N       Before      After
    0       0.85%
    1      69.80%        ->   The new fast path.
   2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 fence)
  10-19
  20-40
   50+

2) Cyberpunk 2077 under Gamescope

1050 calls per second.

    N       Before      After
    0       0.71%
    1      52.53%        ->    The new fast path.
   2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
  10-19     2.34%
  20-40     0.06%
   50+

3) vkcube under Plasma

90 calls per second.

    N       Before      After
    0
    1
   2-9      100%         0%   (Ie. all resolved to a single fence)
  10-19
  20-40
   50+

In the case of vkcube all invocations in the 2-9 bucket were actually
just two input fences.

Nice to have some numbers at hand.


Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Friedrich Vock <friedrich.vock@xxxxxx>
---
  drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 628af51c81af..75c3e37fd617 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
  					   struct dma_fence **fences,
  					   struct dma_fence_unwrap *iter)
  {
+	struct dma_fence *tmp, *signaled, **array;

I would name that unsignaled instead.

  	struct dma_fence_array *result;
-	struct dma_fence *tmp, **array;
  	ktime_t timestamp;
  	unsigned int i;
  	size_t count;
@@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
  	for (i = 0; i < num_fences; ++i) {
  		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
  			if (!dma_fence_is_signaled(tmp)) {
+				signaled = tmp;

You need to grab a reference to tmp here if you want to keep it.

It's perfectly possible that tmp is garbage collected as soon as you go 
to the next iteration or leave the loop.

Regards,
Christian.

  				++count;
  			} else {
  				ktime_t t = dma_fence_timestamp(tmp);
@@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
  	/*
  	 * If we couldn't find a pending fence just return a private signaled
  	 * fence with the timestamp of the last signaled one.
+	 *
+	 * Or if there was a single unsignaled fence left we can return it
+	 * directly and early since that is a major path on many workloads.
  	 */
  	if (count == 0)
  		return dma_fence_allocate_private_stub(timestamp);
+	else if (count == 1)
+		return dma_fence_get(signaled);
  
  	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
  	if (!array)