Re: [PATCH v2 3/7] drm/amdgpu: Add new function to put GPU power profile

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Fri, 25 Aug 2023 16:57:39 +0530

On 8/25/2023 4:48 PM, Yadav, Arvind wrote:

On 8/22/2023 6:16 PM, Lazar, Lijo wrote:


On 8/22/2023 5:41 PM, Yadav, Arvind wrote:
Hi Lijo,

The *_set function will set the GPU power profile and the *_put 
function will  schedule the
smu_delayed_work task after 100ms delay. This smu_delayed_work task 
will clear a GPU
power profile if any new jobs are not scheduled within 100 ms. But if 
any new job  comes within 100ms
then the *_workload_profile_set function  will cancel this work and 
set the GPU power profile based on
preferences.

Please see the below case.

case 1 - only same profile jobs run. It will take 100ms to clear the 
profile once all jobs complete.

                                            wl = VIDEO <100ms>
workload _________|`````````````````````````````````````|____

Jobs (VIDEO) ________|```|__|```|___|````|___________


Case2 - two jobs of two different profile. job1 profile will be set 
but when job2 will arrive it will be moved
         to higher profile.

                                  wl = VIDEO  ->    wl = COMPUTE <100ms>
workload 
___|``````````````````````````````````````````````````````````````````|____ 


Jobs (VIDEO) ___|```|__|```|___|````|___|````|_______

Jobs (COMPUTE) ______________|```|___|````|___|````|_________



Case3 - two jobs of two different profile. job1 profile will be set 
but when job2 will arrive it will not be moved
to lower profile. When compute job2 will complete then only it will 
move to lower profile.

                                              wl = COMPUTE 
->               wl = VIDEO  <100ms>
workload 
_________|``````````````````````````````````````````````````````````````````|____ 


Jobs (COMPUTE)    ____|```|__|```|___|````|___|````|_______

Jobs (VIDEO) 
___________________|```|___|````|___|````|___|````|___________


swsmu layer maintains a workload mask based on priority. So once you 
have set the mask, until you unset it (i.e when refcount = 0), the 
mask will be set in the lower layer. swsmu layer will take care of 
requesting FW the highest priority. I don't think that needs to be 
repeated at this level.

At this layer, all you need is to refcount the requests and make the 
request.

When refcount of a profile becomes non-zero (only one-time), place one 
request for that profile. As swsmu layer maintains the workload mask, 
it will take the new profile also into consideration while requesting 
for the one  with the highest priority.

When refcount of a profile becomes zero, place a request to clear it. 
This is controlled by your idle work. As I see, it keeps an additional 
100ms tolerance before placing a clear request. In that way, there is 
no need to cancel that work.

Inside idle work handler -
Loop through the profiles that are set and clear those profiles whose 
refcount is zero.

Thus if a job starts during the 100ms delay, idle work won't see the 
ref count as zero and then it won't place a request to clear out that 
profile.

Hi Liji,

Thank you for your comment. We would be considering your comment but we 
would retain the same design.


All things aside, the entire idea of switching power profile for every 
job submission on a ring looks like an 'abuse' of the power profile 
design. The goal of power profile is to keep a specific profile for a 
sustained workload - like gaming mode, cinema mode etc. It's not meant 
for like switch profile with every job submission which lasts ms or 
lesser (though you may argue it takes only highest priority profile). 
This design is to keep interrupting FW every now and then thinking 
driver is doing better. For any normal/mixed use scenarios, FW 
algorithms could handle it better with all the activity monitors they have.

If you are going ahead, please also make sure to post the improved 
performance numbers you are getting with this.

Thanks,
Lijo

~Arvind.

On 8/22/2023 10:21 AM, Lazar, Lijo wrote:


On 8/21/2023 12:17 PM, Arvind Yadav wrote:
This patch adds a function which will clear the GPU
power profile after job finished.

This is how it works:
- schedular will set the GPU power profile based on ring_type.
- Schedular will clear the GPU Power profile once job finished.
- Here, the *_workload_profile_set function will set the GPU
   power profile and the *_workload_profile_put function will
   schedule the smu_delayed_work task after 100ms delay. This
   smu_delayed_work task will clear a GPU power profile if any
   new jobs are not scheduled within 100 ms. But if any new job
   comes within 100ms then the *_workload_profile_set function
   will cancel this work and set the GPU power profile based on
   preferences.

v2:
- Splitting workload_profile_set and workload_profile_put
   into two separate patches.
- Addressed review comment.

Cc: Shashank Sharma <shashank.sharma@xxxxxxx>
Cc: Christian Koenig <christian.koenig@xxxxxxx>
Cc: Alex Deucher <alexander.deucher@xxxxxxx>
Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c  | 97 
+++++++++++++++++++
  drivers/gpu/drm/amd/include/amdgpu_workload.h |  3 +
  2 files changed, 100 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
index e661cc5b3d92..6367eb88a44d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
@@ -24,6 +24,9 @@
    #include "amdgpu.h"
  +/* 100 millsecond timeout */
+#define SMU_IDLE_TIMEOUT    msecs_to_jiffies(100)
+
  static enum PP_SMC_POWER_PROFILE
  ring_to_power_profile(uint32_t ring_type)
  {
@@ -59,6 +62,80 @@ amdgpu_power_profile_set(struct amdgpu_device 
*adev,
      return ret;
  }
  +static int
+amdgpu_power_profile_clear(struct amdgpu_device *adev,
+               enum PP_SMC_POWER_PROFILE profile)
+{
+    int ret = amdgpu_dpm_switch_power_profile(adev, profile, false);
+
+    if (!ret) {
+        /* Clear the bit for the submitted workload profile */
+        adev->smu_workload.submit_workload_status &= ~(1 << profile);
+    }
+
+    return ret;
+}
+
+static void
+amdgpu_power_profile_idle_work_handler(struct work_struct *work)
+{
+
+    struct amdgpu_smu_workload *workload = container_of(work,
+                              struct amdgpu_smu_workload,
+                              smu_delayed_work.work);
+    struct amdgpu_device *adev = workload->adev;
+    bool reschedule = false;
+    int index  = fls(workload->submit_workload_status);
+    int ret;
+
+    mutex_lock(&workload->workload_lock);
+    for (; index > 0; index--) {

Why not use for_each_set_bit?

We are clearing which we have only set it. We will clear first higher 
profile then lower.


You don't need to do take care of this. swsmu layer will take care of 
the priority. It is not the job of this layer to take care of 
priority. swsmu is the layer that could be altered specific to each 
SOC, and that can take care of any priority changes accordingly. This 
layer only needs to ref count the requests and place accordingly.



+        int val = atomic_read(&workload->power_profile_ref[index]);
+
+        if (val) {
+            reschedule = true;

Why do you need to do reschedule? For each put(), a schedule is 
called. If refcount is not zero, that means some other job has 
already set the profile. It is supposed to call put() and at that 
time, this job will be run to clear it anyway, right?

Yes, I have got the comment for this I am going to remove this.
Noted.

+        } else {
+            if (workload->submit_workload_status &
+                (1 << index)) {
+                ret = amdgpu_power_profile_clear(adev, index);
+                if (ret) {
+                    DRM_WARN("Failed to clear workload %s,error = 
%d\n",
+                         amdgpu_workload_mode_name[index], ret);
+                    goto exit;
+                }
+            }
+        }
+    }
+    if (reschedule)
+ schedule_delayed_work(&workload->smu_delayed_work,
+                      SMU_IDLE_TIMEOUT);
+exit:
+    mutex_unlock(&workload->workload_lock);
+}
+
+void amdgpu_workload_profile_put(struct amdgpu_device *adev,
+                 uint32_t ring_type)
+{
+    struct amdgpu_smu_workload *workload = &adev->smu_workload;
+    enum PP_SMC_POWER_PROFILE profile = 
ring_to_power_profile(ring_type);
+
+    if (profile == PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT)
+        return;
+
+    mutex_lock(&workload->workload_lock);
+
+    if (!atomic_read(&workload->power_profile_ref[profile])) {
+        DRM_WARN("Power profile %s ref. count error\n",
+             amdgpu_workload_mode_name[profile]);
+    } else {
+ atomic_dec(&workload->power_profile_ref[profile]);
+ schedule_delayed_work(&workload->smu_delayed_work,
+                      SMU_IDLE_TIMEOUT);
+    }
+
+    mutex_unlock(&workload->workload_lock);
+}
+
  void amdgpu_workload_profile_set(struct amdgpu_device *adev,
                   uint32_t ring_type)
  {
@@ -70,13 +147,30 @@ void amdgpu_workload_profile_set(struct 
amdgpu_device *adev,
          return;
        mutex_lock(&workload->workload_lock);
+ cancel_delayed_work_sync(&workload->smu_delayed_work);

This is a potential deadlock. You already hold the mutex and then 
waiting for idle work to finish. Idle work could now be at the point 
where it is waiting for the same mutex. Suggest not to call cancel 
here and let the mutex take care of the sequence.
We cannot cancel if idle work is running. So we have to wait until 
ideal work is complete. If *put function arrived before ideal work is 
not stated then we can cancel it. but if it is running work thread we 
should wait.

No need to wait, because you already have a mutex. So you will be 
waiting naturally for the mutex lock to be released (if at all idle 
work already grabbed it). If a request comes in at the time while idle 
work is running it is only a timing issue.

Also you have a deadlock here. Here you acquired the mutex first and 
then waiting for the idle work to finish. The idle work function would 
have just started at that point and reached to the place where it is 
going to grab mutex. That is a deadlock. This function is waiting for 
idle work to finish and idle work is waiting to get the mutex.

Nevertheless, this function doesn't even need to take care of such 
fancy things. It only grabs the mutex and increases the refcount, 
places a request if refcount became non-zero.

At whatever point, idle work runs, it will see that the refcount is 
not zero and skips placing a request to clear that profile.


        ret = amdgpu_power_profile_set(adev, profile);
      if (ret) {
          DRM_WARN("Failed to set workload profile to %s, error = 
%d\n",
               amdgpu_workload_mode_name[profile], ret);
+        goto exit;
+    }
+
+    /* Clear the already finished jobs of higher power profile*/
+    for (int index = fls(workload->submit_workload_status);
+         index > profile; index--) {
+        if (!atomic_read(&workload->power_profile_ref[index]) &&
+            workload->submit_workload_status & (1 << index)) {
+            ret = amdgpu_power_profile_clear(adev, index);
+            if (ret) {
+                DRM_WARN("Failed to clear workload %s, err = %d\n",
+                     amdgpu_workload_mode_name[profile], ret);
+                goto exit;
+            }
+        }

If you follow the earlier comment, that will keep this logic only at 
one place - i.e, at idle work handler. Basically just let the idle 
work handle its duty. If some job starts running during the clear 
call, it's just unfortunate timing and let the next set() take the 
lock and request profile again.

So basically for every millisecond  new jobs are coming and 
completing it to the same or different profile . Suppose we are 
running higher profile jobs and  before it completes if a lower job 
arrives, this check will help to move the higher profile to lower 
profile once higher profile finishes it. If we are not checking here 
then it will stuck on higher profile until then other jobs will also 
not complete. Please refer case3 scenario.


As mentioned before, this is not the place to take care of SOC 
specific power profile priorities. We already have swsmu layer doing 
that job. This layer just needs to do a ref count and place requests 
accordingly.

Thanks,
Lijo


Thanks,
Lijo

      }
  +exit:
      mutex_unlock(&workload->workload_lock);
  }
  @@ -87,6 +181,8 @@ void amdgpu_workload_profile_init(struct 
amdgpu_device *adev)
      adev->smu_workload.initialized = true;
mutex_init(&adev->smu_workload.workload_lock);
+ INIT_DELAYED_WORK(&adev->smu_workload.smu_delayed_work,
+              amdgpu_power_profile_idle_work_handler);
  }
    void amdgpu_workload_profile_fini(struct amdgpu_device *adev)
@@ -94,6 +190,7 @@ void amdgpu_workload_profile_fini(struct 
amdgpu_device *adev)
      if (!adev->smu_workload.initialized)
          return;
  + cancel_delayed_work_sync(&adev->smu_workload.smu_delayed_work);
      adev->smu_workload.submit_workload_status = 0;
      adev->smu_workload.initialized = false;
mutex_destroy(&adev->smu_workload.workload_lock);
diff --git a/drivers/gpu/drm/amd/include/amdgpu_workload.h 
b/drivers/gpu/drm/amd/include/amdgpu_workload.h
index 5022f28fc2f9..ee1f87257f2d 100644
--- a/drivers/gpu/drm/amd/include/amdgpu_workload.h
+++ b/drivers/gpu/drm/amd/include/amdgpu_workload.h
@@ -46,6 +46,9 @@ static const char * const 
amdgpu_workload_mode_name[] = {
      "Window3D"
  };
  +void amdgpu_workload_profile_put(struct amdgpu_device *adev,
+                 uint32_t ring_type);
+
  void amdgpu_workload_profile_set(struct amdgpu_device *adev,
                   uint32_t ring_type);