Re: [RFC] drm/amdgpu: More efficient ring padding

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 12 Jul 2024 15:04:14 +0200

Am 12.07.24 um 11:14 schrieb Tvrtko Ursulin:
On 12/07/2024 08:33, Christian König wrote:
Am 11.07.24 um 20:17 schrieb Tvrtko Ursulin:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>

 From the department of questionable optimisations today we have a 
minor
improvement to how padding / filling the rings with nops is done.

Having noticed that typically 200+ nops per submission are filled 
into the
ring, using a rather verbose 
one-nop-at-a-time-plus-ring-buffer-arithmetic
as done in amdgpu_ring_write(), while right next to it there is
amdgpu_ring_write_multiple(), I thought why not pre-cache a block of 
nops
and write them out more efficiently.

The patch is rather quick and dirty, does not deal with all insert_nops
vfuncs, and the cache should probably go one level up so it is not
replicated per amdgpu_ring instance.

Why should that be more effective? We essentially use more cache 
lines than before.

Because:

        for (i = 0; i < count; i++)
            amdgpu_ring_write(ring, ring->funcs->nop);

Expands to quite a lot compared to one memcpy from 
amdgpu_ring_write_multiple, and only one set of ring pointer arithmetic?

Well maybe make another function, e.g. something like amdgpu_ring_fill() 
which just fills in the same dw multiple times.

Or if we only use it here just inline that.


And performance gains are not that amazing for normal workloads. For
instance a game which results in two submissions per frame, each pads
with 222 nops, submission worker thread profile changes from:

Mhm, why the heck are we using so many nops in the first place?

If that was a question for me I cannot offer but a superficially 
obvious answer - because ring->funcs->align_mask is 0xff on many 
platforms? I mean on the face of it it is doing what it wants to do - 
pad to 256 dword boundary before passing the updated ring to the GPU.

???? That would be a 1k byte alignment. I haven't looked into that in 
the last decade, but that used to be no more than 0xf.


Btw another thing could also be more efficient by avoiding the div 
instruction:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 22ec9de62b06..c30206f4cd22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -151,7 +151,7 @@ void amdgpu_ring_commit(struct amdgpu_ring *ring)
        /* We pad to match fetch size */
        count = ring->funcs->align_mask + 1 -
                (ring->wptr & ring->funcs->align_mask);
-       count %= ring->funcs->align_mask + 1;
+       count &= ring->funcs->align_mask;

Mhm, that's what it originally used it to be (at least in the ring_write 
function).

Regards,
Christian.

ring->funcs->insert_nop(ring, count);

        mb();

Regards,

Tvrtko

+   90.78%     0.67%  kworker/u32:3-e [kernel.kallsyms]  [k] 
process_one_work
+   48.92%     0.12%  kworker/u32:3-e  [kernel.kallsyms]  [k] 
commit_tail
+   41.18%     1.73%  kworker/u32:3-e  [kernel.kallsyms]  [k] 
amdgpu_dm_atomic_commit_tail
-   30.31%     0.67%  kworker/u32:3-e  [kernel.kallsyms]  [k] 
drm_sched_run_job_work
    - 29.63% drm_sched_run_job_work
       + 8.55% dma_fence_add_callback
       - 7.50% amdgpu_job_run
          - 7.43% amdgpu_ib_schedule
             - 2.46% amdgpu_ring_commit
                  1.44% amdgpu_ring_insert_nop

To:

+   89.83%     0.51%  kworker/u32:6-g  [kernel.kallsyms]  [k] 
process_one_work
+   47.65%     0.30%  kworker/u32:6-g  [kernel.kallsyms]  [k] 
commit_tail
+   39.42%     1.97%  kworker/u32:6-g  [kernel.kallsyms]  [k] 
amdgpu_dm_atomic_commit_tail
-   29.57%     1.10%  kworker/u32:6-g  [kernel.kallsyms]  [k] 
drm_sched_run_job_work
    - 28.47% drm_sched_run_job_work
       + 8.52% dma_fence_add_callback
       - 6.33% amdgpu_job_run
          - 6.19% amdgpu_ib_schedule
             - 1.85% amdgpu_ring_commit
                  0.53% amdgpu_ring_insert_nop

Or if we run a more "spammy" workload, which does several orders of
magnitude more submissions second we go from:

+   79.38%     1.66%  kworker/u32:1+e  [kernel.kallsyms]  [k] 
process_one_work
-   63.13%     6.66%  kworker/u32:1+e  [kernel.kallsyms]  [k] 
drm_sched_run_job_work
    - 56.47% drm_sched_run_job_work
       - 25.67% amdgpu_job_run
          - 24.40% amdgpu_ib_schedule
             - 15.29% amdgpu_ring_commit
                  12.06% amdgpu_ring_insert_nop

To:

+   77.76%     1.97%  kworker/u32:6-f  [kernel.kallsyms]  [k] 
process_one_work
-   60.15%     7.04%  kworker/u32:6-f  [kernel.kallsyms]  [k] 
drm_sched_run_job_work
    - 53.11% drm_sched_run_job_work
       - 19.35% amdgpu_job_run
          - 17.85% amdgpu_ib_schedule
             - 7.75% amdgpu_ring_commit
                  3.27% amdgpu_ring_insert_nop

Not bad and "every little helps", or flame-throwers at ready?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 20 +++++++++++++++-----
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
  2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index ad49cecb20b8..22ec9de62b06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -108,10 +108,14 @@ int amdgpu_ring_alloc(struct amdgpu_ring 
*ring, unsigned int ndw)
   */
  void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
  {
-    int i;
+    if (count > 1 && count <= ARRAY_SIZE(ring->nop_cache)) {
+        amdgpu_ring_write_multiple(ring, ring->nop_cache, count);
+    } else {
+        int i;
-    for (i = 0; i < count; i++)
-        amdgpu_ring_write(ring, ring->funcs->nop);
+        for (i = 0; i < count; i++)
+            amdgpu_ring_write(ring, ring->funcs->nop);
+    }
  }
  /**
@@ -124,8 +128,11 @@ void amdgpu_ring_insert_nop(struct amdgpu_ring 
*ring, uint32_t count)
   */
  void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct 
amdgpu_ib *ib)
  {
-    while (ib->length_dw & ring->funcs->align_mask)
-        ib->ptr[ib->length_dw++] = ring->funcs->nop;
+    u32 count = ib->length_dw & ring->funcs->align_mask;
+
+    memcpy(&ib->ptr[ib->length_dw], ring->nop_cache, count * 
sizeof(u32));
+
+    ib->length_dw += count;
  }
  /**
@@ -359,6 +366,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, 
struct amdgpu_ring *ring,
              &ring->sched;
      }
+    for (r = 0; r < ARRAY_SIZE(ring->nop_cache); r++)
+        ring->nop_cache[r] = ring->funcs->nop;
+
      return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 582053f1cd56..74ce95b4666a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -246,6 +246,7 @@ struct amdgpu_ring {
      struct amdgpu_bo    *ring_obj;
      volatile uint32_t    *ring;
      unsigned        rptr_offs;
+    u32            nop_cache[256];
      u64            rptr_gpu_addr;
      volatile u32        *rptr_cpu_addr;
      u64            wptr;