RE: [PATCH] drm/amd/pm: Send message when resp status is 0xFC

"Quan, Evan" <Evan.Quan@xxxxxxx> · Tue, 8 Mar 2022 06:56:59 +0000

[Public]

Sorry for missing this. Busy with new asic bringup.
Let's adopt this for now and we can revise it later if we do see some corner case.
Reviewed-by: Evan Quan <evan.quan@xxxxxxx>

Thanks,
Evan
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Tuesday, March 8, 2022 1:58 PM
> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> <KevinYang.Wang@xxxxxxx>
> Subject: RE: [PATCH] drm/amd/pm: Send message when resp status is 0xFC
> 
> [Public]
> 
> Hi Evan,
> 
> Any comments on this?
> 
> Thanks,
> Lijo
> 
> -----Original Message-----
> From: Lazar, Lijo
> Sent: Friday, February 25, 2022 7:30 PM
> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> <KevinYang.Wang@xxxxxxx>
> Subject: Re: [PATCH] drm/amd/pm: Send message when resp status is 0xFC
> 
> 
> 
> On 2/25/2022 6:33 PM, Quan, Evan wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >> Sent: Friday, February 25, 2022 3:43 PM
> >> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander
> >> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> >> <KevinYang.Wang@xxxxxxx>
> >> Subject: Re: [PATCH] drm/amd/pm: Send message when resp status is
> >> 0xFC
> >>
> >>
> >>
> >> On 2/25/2022 1:02 PM, Quan, Evan wrote:
> >>> [AMD Official Use Only]
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >>>> Sent: Friday, February 25, 2022 2:03 PM
> >>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander
> >>>> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> >>>> <KevinYang.Wang@xxxxxxx>
> >>>> Subject: Re: [PATCH] drm/amd/pm: Send message when resp status is
> >>>> 0xFC
> >>>>
> >>>>
> >>>>
> >>>> On 2/25/2022 11:25 AM, Quan, Evan wrote:
> >>>>> [AMD Official Use Only]
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >>>>>> Sent: Friday, February 25, 2022 1:47 PM
> >>>>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-
> >> gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>>> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher,
> Alexander
> >>>>>> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> >>>>>> <KevinYang.Wang@xxxxxxx>
> >>>>>> Subject: RE: [PATCH] drm/amd/pm: Send message when resp status
> is
> >>>>>> 0xFC
> >>>>>>
> >>>>>> [AMD Official Use Only]
> >>>>>>
> >>>>>>> That is the caller can perform something like issuing the same
> >>>>>>> message again without prerequisites check on PMFW busy
> >>>>>>
> >>>>>> This patch expects this method. Caller may try to resend message
> >>>>>> again. As part of message sending, driver first checks response
> >>>>>> register. Current logic blocks sending any message if it sees
> >>>>>> 0xFC in response register, this patch is to address that.
> >>>>> [Quan, Evan] Yes, I know. But the caller here could be another one.
> >>>>> I mean
> >>>> there may be another caller stepped in.
> >>>>>
> >>>>
> >>>> That shouldn't cause an issue to the second caller if it got message
> mutex.
> >>>> The second caller also should be able to send message if PMFW got
> >>>> free by that time. The first caller can retry when it gets back the
> >>>> message mutex. FW doesn't maintain any state for 0xFC response. Any
> >>>> other message may be sent after that. If driver keeps the state
> >>>> based on two callers, that is a logic problem in driver. I don't
> >>>> think we have any
> >> flow like that.
> >>> [Quan, Evan] Yeah, but there may be some case that messages issued
> >>> by
> >> the two callers have dependence.
> >>> That means the message issued by the 2nd caller should be only after
> >>> the
> >> 1st one.
> >>> The one i can think of is "EnableAllSmuFeatures" message should be
> >>> after
> >> "SetAllowedFeatures" message.
> >>> Although that should not cause any problem, I'm not sure whether
> >>> there is
> >> other similar case.
> >>>
> >>> What I suggest is something like below. We just do it again in
> >> smu_cmn_send_smc_msg_with_param() on PMFW busy.
> >>>
> >>> int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
> >>> 				    enum smu_message_type msg,
> >>> 				    uint32_t param,
> >>> 				    uint32_t *read_arg)
> >>> {
> >>> ...
> >>> ...
> >>> 	mutex_lock(&smu->message_lock);
> >>> 	reg = __smu_cmn_poll_stat(smu);
> >>> 	res = __smu_cmn_reg2errno(smu, reg);
> >>> 	if (reg == SMU_RESP_NONE ||
> >>> 	    reg == SMU_RESP_BUSY_OTHER ||
> >>> 	    res == -EREMOTEIO) {
> >>> 		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
> >>> 		goto Out;
> >>> 	}
> >>> +retry:
> >>> 	__smu_cmn_send_msg(smu, (uint16_t) index, param);
> >>> 	reg = __smu_cmn_poll_stat(smu);
> >>> 	res = __smu_cmn_reg2errno(smu, reg);
> >>> +	if (reg == SMU_RESP_BUSY_OTHER) {
> >>> +		mdelay(1);
> >>> +		goto retry;
> >>> +	}
> >>
> >> I suppose the retry option should be left to caller. Regardless of
> >> retry or not, the patch is still valid.
> >>
> >> Example situation -
> >>
> >> rocm-smi is trying to get metrics and another app is trying set
> >> performance profile. If metrics fetch fail and even retry of metrics
> >> fetch fail after some loops, rocm-smi is free to fetch the metrics
> >> again after say 5s. That also shouldn't prevent the second app to
> >> send performance profile message and that app also may retry the same
> later.
> > [Quan, Evan] I have no doubt the second app may still be able to send
> performance profile message.
> > However, the metrics data retrieved by rocm-smi will be different
> > under such scenario > That is after performance profile setting the
> > clock frequencies may
> go up.
> > If the first app is sensitive to that(suppose it wants to compare the
> > clock frequencies before and after performance profile setting), that
> > will be a problem
> 
> Ah, my intention was to tell about a different use case, probably a bad
> example.
> 
> What I intended is -
> 
> rocm-smi is running periodically collecting metrics data and another app
> which doesn't bother about the background data collection trying to do
> something else. Thus a metrics fetch fail shouldn't prevent the other app
> from changing performance profile or whatever it intended to do.
> Also, rocm-smi may be able to fetch data during the next polling interval
> when the PMFW gets relatively free.
> >
> > I reconsider this a bit. Can we add one more parameter for
> smu_cmn_send_smc_msg_with_param()?
> > That parameter can tell what the caller wants(retry or abort) under PMFW
> busy scenario.
> 
> There is one complication to this. The same message could return 0xFC under
> different conditions, and it may be fine for situation A but not for situation B.
> For ex: during driver load or init after a reset, we don't expect PMFW to be
> busy.Driver may send message to get metrics data to check say 'smart shift is
> enabled or not'. A failure with 0xFC is not a good sign that time and we
> should abort there.
> 
> The same message sent during runtime (like rocm-smi data collection) may
> fail with 0xFC due to workload conditions.
> 
> Adding another parameter will complicate things as we need to check every
> condition of message usage. In the present case, a message failure with 0xFC
> is treated as failure during init and we don't proceed. For a runtime use, apps
> always have a choice whether to retry the same API or not.
> 
> Thanks,
> Lijo
> 
> >
> > BR
> > Evan
> >>
> >> Thanks,
> >> Lijo
> >>
> >>> ...
> >>> ...
> >>> }
> >>>
> >>> BR
> >>> Evan
> >>>>
> >>>> Basically, 0xFC is not valid pre-condition check for sending any
> >>>> message. As per PMFW team - it only means that PMFW was busy
> when a
> >>>> previous message was sent and PMFW won't change the response
> status
> >>>> when it becomes free.
> >>>>
> >>>> Thanks,
> >>>> Lijo
> >>>>
> >>>>> BR
> >>>>> Evan
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Lijo
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Quan, Evan <Evan.Quan@xxxxxxx>
> >>>>>> Sent: Friday, February 25, 2022 11:07 AM
> >>>>>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>;
> >>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>>> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher,
> Alexander
> >>>>>> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> >>>>>> <KevinYang.Wang@xxxxxxx>
> >>>>>> Subject: RE: [PATCH] drm/amd/pm: Send message when resp status
> is
> >>>>>> 0xFC
> >>>>>>
> >>>>>> [AMD Official Use Only]
> >>>>>>
> >>>>>> This may introduce some problems for two callers scenarios. That
> >>>>>> is the 2nd one will still proceed even if the 1st one was already
> blocked.
> >>>>>> Maybe the logics here should be performed by the caller. That is
> >>>>>> the caller can perform something like issuing the same message
> >>>>>> again without prerequisites check on PMFW busy.
> >>>>>> Or we can just update the smu_cmn_send_smc_msg APIs to give it
> >>>>>> another try on PMFW busy.
> >>>>>>
> >>>>>> BR
> >>>>>> Evan
> >>>>>>> -----Original Message-----
> >>>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >>>>>>> Sent: Friday, February 25, 2022 12:22 PM
> >>>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>>>> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher,
> >> Alexander
> >>>>>>> <Alexander.Deucher@xxxxxxx>; Wang, Yang(Kevin)
> >>>>>>> <KevinYang.Wang@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>
> >>>>>>> Subject: [PATCH] drm/amd/pm: Send message when resp status is
> >> 0xFC
> >>>>>>>
> >>>>>>> When PMFW is really busy, it will respond with 0xFC. However, it
> >>>>>>> doesn't change the response register state when it becomes free.
> >>>>>>> Driver should retry and proceed to send message if the response
> >>>>>>> status is
> >>>>>> 0xFC.
> >>>>>>>
> >>>>>>> Signed-off-by: Lijo Lazar <lijo.lazar@xxxxxxx>
> >>>>>>> ---
> >>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 --
> >>>>>>>     1 file changed, 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>>> index 590a6ed12d54..92161b9d8c1a 100644
> >>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>>> @@ -297,7 +297,6 @@ int
> >> smu_cmn_send_msg_without_waiting(struct
> >>>>>>> smu_context *smu,
> >>>>>>>     	reg = __smu_cmn_poll_stat(smu);
> >>>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
> >>>>>>>     	if (reg == SMU_RESP_NONE ||
> >>>>>>> -	    reg == SMU_RESP_BUSY_OTHER ||
> >>>>>>>     	    res == -EREMOTEIO)
> >>>>>>>     		goto Out;
> >>>>>>>     	__smu_cmn_send_msg(smu, msg_index, param); @@ -
> >> 391,7 +390,6
> >>>>>> @@ int
> >>>>>>> smu_cmn_send_smc_msg_with_param(struct
> >>>>>>> smu_context *smu,
> >>>>>>>     	reg = __smu_cmn_poll_stat(smu);
> >>>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
> >>>>>>>     	if (reg == SMU_RESP_NONE ||
> >>>>>>> -	    reg == SMU_RESP_BUSY_OTHER ||
> >>>>>>>     	    res == -EREMOTEIO) {
> >>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
> >> msg);
> >>>>>>>     		goto Out;
> >>>>>>> --
> >>>>>>> 2.25.1