Re: [PATCH v4 4/5] arm64: dts: sdm845: Add OPP tables and power-domains for venus

Bjorn Andersson <bjorn.andersson@xxxxxxxxxx> · Wed, 29 Jul 2020 13:38:20 -0700

On Tue 28 Jul 13:11 PDT 2020, Lina Iyer wrote:

> On Tue, Jul 28 2020 at 13:51 -0600, Stephen Boyd wrote:
> > Quoting Lina Iyer (2020-07-28 09:52:12)
> > > On Mon, Jul 27 2020 at 18:45 -0600, Stephen Boyd wrote:
> > > >Quoting Lina Iyer (2020-07-24 09:28:25)
> > > >> On Fri, Jul 24 2020 at 03:03 -0600, Rajendra Nayak wrote:
> > > >> >Hi Maulik/Lina,
> > > >> >
> > > >> >On 7/23/2020 11:36 PM, Stanimir Varbanov wrote:
> > > >> >>Hi Rajendra,
> > > >> >>
> > > >> >>After applying 2,3 and 4/5 patches on linaro-integration v5.8-rc2 I see
> > > >> >>below messages on db845:
> > > >> >>
> > > >> >>qcom-venus aa00000.video-codec: dev_pm_opp_set_rate: failed to find
> > > >> >>current OPP for freq 533000097 (-34)
> > > >> >>
> > > >> >>^^^ This one is new.
> > > >> >>
> > > >> >>qcom_rpmh TCS Busy, retrying RPMH message send: addr=0x30000
> > > >> >>
> > > >> >>^^^ and this message is annoying, can we make it pr_debug in rpmh?
> > > >> >
> > > >> How annoyingly often do you see this message?
> > > >> Usually, this is an indication of bad system state either on remote
> > > >> processors in the SoC or in Linux itself. On a smooth sailing build you
> > > >> should not see this 'warning'.
> > > >>
> > > >> >Would you be fine with moving this message to a pr_debug? Its currently
> > > >> >a pr_info_ratelimited()
> > > >> I would rather not, moving this out of sight will mask a lot serious
> > > >> issues that otherwise bring attention to the developers.
> > > >>
> > > >
> > > >I removed this warning message in my patch posted to the list[1]. If
> > > >it's a serious problem then I suppose a timeout is more appropriate, on
> > > >the order of several seconds or so and then a pr_warn() and bail out of
> > > >the async call with an error.
> > > >
> > > The warning used to capture issues that happen within a second and it
> > > helps capture system related issues. Timing out after many seconds
> > > overlooks the system issues that generally tend to resolve itself, but
> > > nevertheless need to be investigated.
> > > 
> > 
> > Is it correct to read "system related issues" as performance problems
> > where the thread is spinning forever trying to send a message and it
> > can't? So the problem is mostly that it's an unbounded amount of time
> > before the message is sent to rpmh and this printk helps identify those
> > situations where that is happening?
> > 
> Yes, but mostly a short period of time like when other processors are in
> the middle of a restart or resource states changes have taken unusual
> amounts of time. The system will generally recover from this without
> crashing in this case. User action is investigation of the situation
> leading to these messages.
> 

Given that these messages shows up from time and seemingly is harmless,
users such as myself implements the action of ignoring these printouts.

In the cases I do see these messages it seems, as you say, to be related
to something happening in the firmware. So it's not something that a
user typically could investigate/debug anyways.

As such I do second Doug's request of not printing what looks like error
messages unless there is a persistent problem - but provide some means
for the few who would find them useful..

Regards,
Bjorn

> > Otherwise as you say above it's a bad system state where the rpmh
> > processor has gotten into a bad state like a crash? Can we recover from
> > that? Or is the only recovery a reboot of the system? Does the rpmh
> > processor reboot the system if it crashes?
> We cannot recover from such a state. The remote processor will reboot if
> it detects a failure at it's end. If the system entered a bad state, it
> is possible that RPMH requests start timing out in Linux and remote
> processor may not detect it. Hence, the timeout in rpmh_write() API. The
> advised course of action is a restart as there is no way to recover from
> this state.
> 
> --Lina
> 
>