Re: [PATCH v3 5/5] drm/amdgpu: switch workload context to/from compute

"Sharma, Shashank" <shashank.sharma@xxxxxxx> · Fri, 30 Sep 2022 12:09:36 +0200

On 9/30/2022 11:54 AM, Lazar, Lijo wrote:

On 9/30/2022 2:52 PM, Sharma, Shashank wrote:

On 9/30/2022 11:13 AM, Lazar, Lijo wrote:

On 9/30/2022 2:07 PM, Sharma, Shashank wrote:

On 9/30/2022 7:08 AM, Lazar, Lijo wrote:

On 9/30/2022 12:02 AM, Alex Deucher wrote:
On Thu, Sep 29, 2022 at 10:14 AM Lazar, Lijo <lijo.lazar@xxxxxxx> 
wrote:

On 9/29/2022 7:30 PM, Sharma, Shashank wrote:

On 9/29/2022 3:37 PM, Lazar, Lijo wrote:
To be clear your understanding -

Nothing is automatic in PMFW. PMFW picks a priority based on the
actual mask sent by driver.

Assuming lower bits corresponds to highest priority -

If driver sends a mask with Bit3 and Bit 0 set, PMFW will chose
profile that corresponds to Bit0. If driver sends a mask with Bit4
Bit2 set and rest unset, PMFW will chose profile that 
corresponds to
Bit2. However if driver sends a mask only with a single bit 
set, it
chooses the profile regardless of whatever was the previous 
profile. t
doesn't check if the existing profile > newly requested one. 
That is
the behavior.

So if a job send chooses a profile that corresponds to Bit0, 
driver
will send that. Next time if another job chooses a profile that
corresponds to Bit1, PMFW will receive that as the new profile and
switch to that. It trusts the driver to send the proper 
workload mask.

Hope that gives the picture.

Thanks, my understanding is also similar, referring to the core 
power
switch profile function here:
amd_powerplay.c::pp_dpm_switch_power_profile()
*snip code*
hwmgr->workload_mask |= (1 << hwmgr->workload_prority[type]);
          index = fls(hwmgr->workload_mask);
          index = index <= Workload_Policy_Max ? index - 1 : 0;
          workload = hwmgr->workload_setting[index];
*snip_code*
hwmgr->hwmgr_func->set_power_profile_mode(hwmgr, &workload, 0);

Here I can see that the new workload mask is appended into the 
existing
workload mask (not overwritten). So if we keep sending new
workload_modes, they would be appended into the workload flags and
finally the PM will pick the most aggressive one of all these 
flags, as
per its policy.

Actually it's misleading -

The path for sienna is -
set_power_profile_mode -> sienna_cichlid_set_power_profile_mode

This code here is a picking one based on lookup table.

   workload_type = smu_cmn_to_asic_specific_index(smu,

CMN2ASIC_MAPPING_WORKLOAD,

smu->power_profile_mode);

This is that lookup table -

static struct cmn2asic_mapping
sienna_cichlid_workload_map[PP_SMC_POWER_PROFILE_COUNT] = {
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT,
WORKLOAD_PPLIB_DEFAULT_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_FULLSCREEN3D,
WORKLOAD_PPLIB_FULL_SCREEN_3D_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_POWERSAVING,
WORKLOAD_PPLIB_POWER_SAVING_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_VIDEO,
WORKLOAD_PPLIB_VIDEO_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_VR,
WORKLOAD_PPLIB_VR_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_COMPUTE,
WORKLOAD_PPLIB_COMPUTE_BIT),
          WORKLOAD_MAP(PP_SMC_POWER_PROFILE_CUSTOM,
WORKLOAD_PPLIB_CUSTOM_BIT),
};

And this is the place of interaction with PMFW. (1 << 
workload_type) is
the mask being sent.

         smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_SetWorkloadMask,
                                      1 << workload_type, NULL);

In the end, driver implementation expects only one bit to be set.

Shashank and I had a discussion about this today.  I think there 
are a
few thing we can do to handle this better:

1. Set a flag that if the user changes the default via sysfs that
overrides any runtime setting via an application since presumably 
that
is what the user wants and we won't change the hint at runtime.
2. Drop the GET API.  There's no need for this, the hint is just a 
hint.

Double checked again based on Felix's comments on API definition. 
Driver decides the priority instead of FW. That way we can still 
keep Get API.

2. Have the driver arbitrate between the available workload profiles
based on the numeric value of the hint (e.g., default < 3D < video <
VR < compute) as the higher values are more aggressive in most cases.
If requests come in for 3D and compute at the same time, the driver
will select compute because it's value is highest.  Each hint type
would be ref counted so we'd know what state to be in every time 
we go
to set the state.  If all of the clients requesting compute go away,
and only 3D requestors remain, we can switch to 3D.  If all refcounts
go to 0, we go back to default.  This will not require any change to
the current workload API in the SMU code.

Since PM layer decides priority, refcount can be kept at powerplay 
and swsmu layer instead of any higher level API.

User API may keep something like req_power_profile (for any 
logging/debug purpose) for the job preference.

No, I think there has been enough confusion around this 
implementation so far, we will implement this just as Alex/Felix 
suggested:
- No change will be done in pm/SMU layer.

Well, a confusion doesn't justify bad implementation. You could just 
keep the refcount in workload_setting.

So far, none of us have any reason to believe its a bad 
implementation. Why is it so, again ?

It's only about keeping track of requests at client layer.

There is absolutely nothing bad or wrong with that, as a matter of fact, 
some driver designs prefer to keep it like this, and let the core API 
minimal and focused on core functionality. This is just about choice.

Another API that uses power profile indirectly also will need to take 
care of refcount and we don't need every other API to do that 
separately without knowing what is the final outcome.

And why ? The dpm_switch_power_profile API was introduced to be used 
by a higher level API, and if a consumer API wants to keep track of 
that, its their own call. This doesn't affect internal PM APIs. The 
whole idea is to manage the PM calls without any change in PM APIs.

Just like per-job-switch-profile is a new usage, there could be other 
new cases as well. Also, there are other APIs which indirectly 
manipulates power profile other than sys.

Understand that there was no reference counting for pm profile change so 
far, as it was probably written considering sysfs interface and never 
considered a multi-client environment.

Like workload context, If there are other current/future clients who 
could also use these APIs, it would be a very important reason to add 
this workload reference counting in central pm structure (rather than 
multiple scattered places), so that every new API/consumer can 
understand and use this, and consider a system-wide scenario of DPM 
power profile, instead of a narrow view of its own thread. The central 
counter will indicate that more than one consumers/clients can change 
the power profile, not only this thread.

- Shashank

All I'm saying is keep the refcount at core layer so that regardless of 
wherever it comes from, it keeps the preference.

So instead of this-
                 smu->workload_mask &= ~(1 << smu->workload_prority[type]);

Have something like this -

     smu->workload[type].reqcount--;
     if (!smu->workload[type].reqcount)
                 smu->workload_mask &= ~(1 << 
smu->workload[type].priority);

I guess, the count was not there because there was no usage of multiple 
clients preferring the same profile at the same time. Now that there is 
a case for this, fix it at where required rather than keeping a track of 
it at client layer.

Thanks,
Lijo

- Shashank

Thanks,
Lijo

- The amdgpu_context_workload layer will keep the ref_counting and 
user_workload_hint management, and it will just call and consume the 
pm_switch_workload profile() like any other client.

- We will add a force flag for calls coming from sysfs() interface, 
and it will take the highest priority. No state machine will be 
managed for sysfs, and it will work as it is working today.

- Shashank

Thanks,
Lijo

Alex

Thanks,
Lijo

Now, when we have a single workload:
-> Job1: requests profile P1 via UAPI, ref count = 1
-> driver sends flags for p1
-> PM FW applies profile P1
-> Job executes in profile P1
-> Job goes to reset function, ref_count = 0,
-> Power profile resets

Now, we have conflicts only when we see multiple workloads (Job1 
and Job 2)
-> Job1: requests profile P1 via UAPI, ref count = 1
-> driver sends flags for p1
-> PM FW applies profile P1
-> Job executes in profile P1
-> Job2: requests profile P2 via UAPI, refcount = 2
-> driver sends flags for (P1|P2)
-> PM FW picks the more aggressive of the two (Say P1, stays in P1)
-> Job1 goes to reset function, ref_count = 1, job1 does not 
reset power
profile
-> Job2 goes to reset function, ref_counter = 2, job 2 resets 
Power profile
-> Power profile resets to None

So this state machine looks like if there is only 1 job, it will be
executed in desired mode. But if there are multiple, the most 
aggressive
profile will be picked, and every job will be executed in 
atleast the
requested power profile mode or higher.

Do you find any problem so far ?

- Shashank

Thanks,
Lijo