Re: [PATCH] drm/amdgpu: Use the slab allocator to reduce job allocation fragmentation

Christian König <christian.koenig@xxxxxxx> · Tue, 14 May 2024 11:04:14 +0200

Am 14.05.24 um 10:13 schrieb Liang, Prike:
[AMD Official Use Only - AMD Internal Distribution Only]

From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Friday, May 10, 2024 5:31 PM
To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Use the slab allocator to reduce job
allocation fragmentation

Am 10.05.24 um 10:11 schrieb Prike Liang:
Using kzalloc() results in about 50% memory fragmentation, therefore
use the slab allocator to reproduce memory fragmentation.

Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 26
++++++++++++++++++++-----
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  1 +
   3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index ea14f1c8f430..3de1b42291b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -3040,6 +3040,7 @@ static void __exit amdgpu_exit(void)
     amdgpu_fence_slab_fini();
     mmu_notifier_synchronize();
     amdgpu_xcp_drv_release();
+   amdgpue_job_slab_fini();
   }

   module_init(amdgpu_init);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e4742b65032d..8327bf017a0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -31,6 +31,8 @@
   #include "amdgpu_trace.h"
   #include "amdgpu_reset.h"

+static struct kmem_cache *amdgpu_job_slab;
+
   static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
drm_sched_job *s_job)
   {
     struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); @@ -
101,10
+103,19 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct
amdgpu_vm *vm,
     if (num_ibs == 0)
             return -EINVAL;

-   *job = kzalloc(struct_size(*job, ibs, num_ibs), GFP_KERNEL);
-   if (!*job)
+   amdgpu_job_slab = kmem_cache_create("amdgpu_job",
+                           struct_size(*job, ibs, num_ibs), 0,
+                           SLAB_HWCACHE_ALIGN, NULL);
Well you are declaring a global slab cache for a dynamic job size, then try to
set it up in the job allocation function which can be called concurrently with
different number of IBs.

To sum it up  this is completely racy and will go boom immediately in testing.
As far as I can see this suggestion is just utterly nonsense.

Regards,
Christian.

Hi, Christian

The num_ibs is calculated as 1 in amdgpu_cs_p1_ib() and from amdgpu_cs_pass1(), the num_ibs will be set to 1 as an input parameter at amdgpu_job_alloc(). Moreover, the num_ibs is only set from amdgpu_cs_p1_ib() and shouldn't have a chance to be overwritten from the user space driver side. Also, I checked a few GL and Vulkan applications and didn't find multiple IBs within one amdgpu job submission.

Well this is just bluntly incorrect. I have no idea were you looked to 
come to this conclusion.

Basically UMDs are allowed to submit multiple IBs with each job, so 
assuming that it's always 1 just because we use 1 as a simple case 
doesn't change that in any way.

See function amdgpu_ring_max_ibs() for the in kernel limit on how many 
IBs can be used for each ring type:

/**
 * amdgpu_ring_max_ibs - Return max IBs that fit in a single submission.
 *
 * @type: ring type for which to return the limit.
 */
unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type)
{
        switch (type) {
        case AMDGPU_RING_TYPE_GFX:
                /* Need to keep at least 192 on GFX7+ for old radv. */
                return 192;
        case AMDGPU_RING_TYPE_COMPUTE:
                return 125;
        case AMDGPU_RING_TYPE_VCN_JPEG:
                return 16;
        default:
                return 49;
        }
}

If there are still concerns about the IB array size on the amdgpu_job object allocated, we can remove the IBs member and decompose the IB with the job object. Then, we can export and access the IBs as a parameter from a new interface like amdgpu_cs_patch_ibs(struct amdgpu_cs_parser *p, struct amdgpu_job *job, struct amdgpu_ib *ib).

And how should that help? Then we have to allocate the IBs separately 
which adds even more overhead.

Regarding this patch, using kmem_cache_zalloc() instead of kzalloc() can save about 448 bytes of memory space for each amdgpu_job object allocated. Meanwhile, the job object allocation takes almost the same time, so it should have no side effect on the performance. If the idea is sensible, I will rework the patch by creating the job slab during the driver probe period.

Well that initializing global variables from a function which can be 
called from multiple threads at the same time without a lock is 
completely racy should be obvious and not something I explicitly need to 
point out.

Then this patch doesn't even bother to check if the slab was already 
allocated before, but instead just calls kmem_cache_create() multiple 
times. This only works because kmem_cache objects internally optimizes 
duplicates with the same parameters away. But since you completely 
messed up the reference count this will leak the slab object after 
driver unload.

Finally using kzalloc() instead of a slab is completely intentionally 
since the allocated object is dynamic in size. This will waste a bit of 
memory for allocations where the structure size doesn't end up as power 
of two, but that is obvious.

I'm sorry to say that but this patch looks like you doesn't even 
understand the basics of what the driver and the code you try to modify 
is doing. So I have to completely reject the whole idea.

Regards,
Christian.


Thanks,
Prike
+   if (!amdgpu_job_slab) {
+           DRM_ERROR("create amdgpu_job cache failed\n");
             return -ENOMEM;
+   }

+   *job = kmem_cache_zalloc(amdgpu_job_slab, GFP_KERNEL);
+   if (!*job) {
+           kmem_cache_destroy(amdgpu_job_slab);
+           return -ENOMEM;
+   }
     /*
      * Initialize the scheduler to at least some ring so that we always
      * have a pointer to adev.
@@ -138,7 +149,7 @@ int amdgpu_job_alloc_with_ib(struct
amdgpu_device *adev,
     if (r) {
             if (entity)
                     drm_sched_job_cleanup(&(*job)->base);
-           kfree(*job);
+           kmem_cache_free(amdgpu_job_slab, job);
     }

     return r;
@@ -179,6 +190,11 @@ void amdgpu_job_free_resources(struct
amdgpu_job *job)
             amdgpu_ib_free(ring->adev, &job->ibs[i], f);
   }

+void amdgpue_job_slab_fini(void)
+{
+   kmem_cache_destroy(amdgpu_job_slab);
+}
+
   static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
   {
     struct amdgpu_job *job = to_amdgpu_job(s_job); @@ -189,7 +205,7
@@
static void amdgpu_job_free_cb(struct drm_sched_job *s_job)

     /* only put the hw fence if has embedded fence */
     if (!job->hw_fence.ops)
-           kfree(job);
+           kmem_cache_free(amdgpu_job_slab, job);
     else
             dma_fence_put(&job->hw_fence);
   }
@@ -221,7 +237,7 @@ void amdgpu_job_free(struct amdgpu_job *job)
             dma_fence_put(job->gang_submit);

     if (!job->hw_fence.ops)
-           kfree(job);
+           kmem_cache_free(amdgpu_job_slab, job);
     else
             dma_fence_put(&job->hw_fence);
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index a963a25ddd62..4491d5f9c74d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -103,5 +103,6 @@ int amdgpu_job_submit_direct(struct amdgpu_job
*job, struct amdgpu_ring *ring,
                          struct dma_fence **fence);

   void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler
*sched);
+void amdgpue_job_slab_fini(void);

   #endif