Re: [Patch v2] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

"Bhardwaj, Rajneesh" <rajneesh.bhardwaj@xxxxxxx> · Thu, 8 Feb 2024 15:01:55 -0500

On 2/8/2024 2:41 PM, Felix Kuehling wrote:

On 2024-02-07 23:14, Rajneesh Bhardwaj wrote:
In certain cooperative group dispatch scenarios the default SPI resource
allocation may cause reduced per-CU workgroup occupancy. Set
COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang
scenarions.

Suggested-by: Joseph Greathouse <Joseph.Greathouse@xxxxxxx>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@xxxxxxx>
---
* Found a bug in the previous reviewed version
https://lists.freedesktop.org/archives/amd-gfx/2024-February/104101.html
   since the q->is_gws is unset for keeping the count.
* updated pqm_set_gws to pass minfo holding gws state for the active
   queues and use that to apply the FORCE_SIMD_DIST_MASK.

  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c        | 4 ++++
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h                  | 1 +
  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 4 +++-
  3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
index 42d881809dc7..0b71db4c96b5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
@@ -303,6 +303,10 @@ static void update_mqd(struct mqd_manager *mm, 
void *mqd,
          update_cu_mask(mm, mqd, minfo, 0);
      set_priority(m, q);
  +    if (minfo && KFD_GC_VERSION(mm->dev) >= IP_VERSION(9, 4, 2))
+        m->compute_resource_limits = minfo->gws ?
+            COMPUTE_RESOURCE_LIMITS__FORCE_SIMD_DIST_MASK : 0;
+

This looks OK because we don't set anything else in 
m->compute_resource_limits. If that ever changes, we have to be more 
careful here to not wipe out other fields in that register.


Yes, Should I change it to below and send a v3?

 m->compute_resource_limits |= minfo->gws ?
            COMPUTE_RESOURCE_LIMITS__FORCE_SIMD_DIST_MASK : 0;




      q->is_active = QUEUE_IS_ACTIVE(*q);
  }
  diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 677281c0793e..f4b327a2d4a8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -542,6 +542,7 @@ struct mqd_update_info {
          } cu_mask;
      };
      enum mqd_update_flag update_flag;
+    bool gws;

Instead of adding a new bool, can we add a flag to mqd_update_flag?

Maybe, I initially thought about it but then I chose the bool approach 
since  those debug flags are generic KFD non per-Asic flags while this 
bool is per-Asic request so I felt they didn't fit together. On the 
other hand, those flags and this bool are both quirks anyways so maybe 
they can be together.   Please let me know your preference.



Looks good to me otherwise.

Regards,
  Felix


  };
    /**
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 43eff221eae5..5416a110ced9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -95,6 +95,7 @@ void kfd_process_dequeue_from_device(struct 
kfd_process_device *pdd)
  int pqm_set_gws(struct process_queue_manager *pqm, unsigned int qid,
              void *gws)
  {
+    struct mqd_update_info minfo = {0};
      struct kfd_node *dev = NULL;
      struct process_queue_node *pqn;
      struct kfd_process_device *pdd;
@@ -146,9 +147,10 @@ int pqm_set_gws(struct process_queue_manager 
*pqm, unsigned int qid,
      }
        pdd->qpd.num_gws = gws ? dev->adev->gds.gws_size : 0;
+    minfo.gws = !!gws;
        return pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm,
-                            pqn->q, NULL);
+                            pqn->q, &minfo);
  }
    void kfd_process_dequeue_from_all_devices(struct kfd_process *p)