[PATCH 1/2] dma-fence: Add a single fence fast path for fence merging

Tvrtko Ursulin <tursulin@xxxxxxxxxx> · Wed, 13 Nov 2024 17:19:46 +0000

From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>

Testing some workloads in two different scenarios, such as games running
under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
that in a significant portion of calls the dma_fence_unwrap_merge helper
is called with just a single unsignalled fence.

Therefore it is worthile to add a fast path for that case and so bypass
the memory allocation and insertion sort attempts.

Tested scenarios:

1) Hogwarts Legacy under Gamescope

450 calls per second to __dma_fence_unwrap_merge.

Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:

   N       Before      After
   0       0.85%
   1      69.80%        ->   The new fast path.
  2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 fence)
 10-19
 20-40
  50+

2) Cyberpunk 2077 under Gamescope

1050 calls per second.

   N       Before      After
   0       0.71%
   1      52.53%        ->    The new fast path.
  2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
 10-19     2.34%
 20-40     0.06%
  50+

3) vkcube under Plasma

90 calls per second.

   N       Before      After
   0
   1
  2-9      100%         0%   (Ie. all resolved to a single fence)
 10-19
 20-40
  50+

In the case of vkcube all invocations in the 2-9 bucket were actually
just two input fences.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Friedrich Vock <friedrich.vock@xxxxxx>
---
 drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 628af51c81af..75c3e37fd617 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 					   struct dma_fence **fences,
 					   struct dma_fence_unwrap *iter)
 {
+	struct dma_fence *tmp, *signaled, **array;
 	struct dma_fence_array *result;
-	struct dma_fence *tmp, **array;
 	ktime_t timestamp;
 	unsigned int i;
 	size_t count;
@@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	for (i = 0; i < num_fences; ++i) {
 		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
 			if (!dma_fence_is_signaled(tmp)) {
+				signaled = tmp;
 				++count;
 			} else {
 				ktime_t t = dma_fence_timestamp(tmp);
@@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	/*
 	 * If we couldn't find a pending fence just return a private signaled
 	 * fence with the timestamp of the last signaled one.
+	 *
+	 * Or if there was a single unsignaled fence left we can return it
+	 * directly and early since that is a major path on many workloads.
 	 */
 	if (count == 0)
 		return dma_fence_allocate_private_stub(timestamp);
+	else if (count == 1)
+		return dma_fence_get(signaled);
 
 	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
 	if (!array)
-- 
2.46.0