Re: [PATCH v9 09/17] arm: tegra20: cpuidle: Handle case where secondary CPU hangs on entering LP2

Dmitry Osipenko <digetx@xxxxxxxxx> · Fri, 21 Feb 2020 21:19:59 +0300

21.02.2020 20:36, Daniel Lezcano пишет:
> On Fri, Feb 21, 2020 at 07:56:51PM +0300, Dmitry Osipenko wrote:
>> Hello Daniel,
>>
>> 21.02.2020 18:43, Daniel Lezcano пишет:
>>> On Thu, Feb 13, 2020 at 02:51:26AM +0300, Dmitry Osipenko wrote:
>>>> It is possible that something may go wrong with the secondary CPU, in that
>>>> case it is much nicer to get a dump of the flow-controller state before
>>>> hanging machine.
>>>>
>>>> Acked-by: Peter De Schrijver <pdeschrijver@xxxxxxxxxx>
>>>> Tested-by: Peter Geis <pgwipeout@xxxxxxxxx>
>>>> Tested-by: Jasper Korten <jja2000@xxxxxxxxx>
>>>> Tested-by: David Heidelberg <david@xxxxxxx>
>>>> Signed-off-by: Dmitry Osipenko <digetx@xxxxxxxxx>
>>>> ---
> 
> [ ... ]
> 
>>>> +static int tegra20_wait_for_secondary_cpu_parking(void)
>>>> +{
>>>> +	unsigned int retries = 3;
>>>> +
>>>> +	while (retries--) {
>>>> +		ktime_t timeout = ktime_add_ms(ktime_get(), 500);
>>>
>>> Oops I missed this one. Do not use ktime_get() in this code path, use jiffies.
>>
>> Could you please explain what benefits jiffies have over the ktime_get()?
> 
> ktime_get() is very slow, jiffies is updated every tick.

But how jiffies are supposed to be updated if interrupts are disabled?

Aren't jiffies actually slower than ktime_get() because jiffies are
updating every 10/1ms (depending on CONFIG_HZ)?

We're kinda interesting here in getting into deep-idling state as quick
as possible. I was checking how much time takes the busy-loop below and
it takes ~40-150us in average, which is good enough.

>>>> +
>>>> +		/*
>>>> +		 * The primary CPU0 core shall wait for the secondaries
>>>> +		 * shutdown in order to power-off CPU's cluster safely.
>>>> +		 * The timeout value depends on the current CPU frequency,
>>>> +		 * it takes about 40-150us  in average and over 1000us in
>>>> +		 * a worst case scenario.
>>>> +		 */
>>>> +		do {
>>>> +			if (tegra_cpu_rail_off_ready())
>>>> +				return 0;
>>>> +
>>>> +		} while (ktime_before(ktime_get(), timeout));
>>>
>>> So this loop will aggresively call tegra_cpu_rail_off_ready() and retry 3
>>> times. The tegra_cpu_rail_off_ready() function can be called thoushand of times
>>> here but the function will hang 1.5s :/
>>>
>>> I suggest something like:
>>>
>>> 	while (retries--i && !tegra_cpu_rail_off_ready()) 
>>> 		udelay(100);
>>>
>>> So <retries> calls to tegra_cpu_rail_off_ready() and 100us x <retries> maximum
>>> impact.
>> But udelay() also results into CPU spinning in a busy-loop, and thus,
>> what's the difference?
> 
> busy looping instead of register reads with all the hardware things involved behind.

Please notice that this code runs only on an older Cortex-A9/A15, which
doesn't support WFE for the delaying, and thus, CPU always busy-loops
inside udelay().

What about if I'll add cpu_relax() to the loop? Do you think it it could
have any positive effect?