Re: [PATCH 4/4] arm64: dts: rockchip: Add OPP data for CPU cores on RK3588

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Alexey,

On 27/01/2024 20:41, Alexey Charkov wrote:
On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:

On 2024-01-26 14:44, Alexey Charkov wrote:
On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano
<daniel.lezcano@xxxxxxxxxx> wrote:
On 26/01/2024 08:49, Dragan Simic wrote:
On 2024-01-26 08:30, Alexey Charkov wrote:
On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
On 2024-01-26 07:44, Alexey Charkov wrote:
On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic@xxxxxxxxxxx>
wrote:
On 2024-01-25 10:30, Daniel Lezcano wrote:
On 24/01/2024 21:30, Alexey Charkov wrote:
By default the CPUs on RK3588 start up in a conservative
performance
mode. Add frequency and voltage mappings to the device tree to
enable

[ ... ]

Throttling would also lower the voltage at some point, which cools it
down much faster!

Of course, but the key is not to cool (and slow down) the CPU cores too
much, but just enough to stay within the available thermal envelope,
which is where the same-voltage, lower-frequency OPPs should shine.

That implies the resulting power is sustainable which I doubt it is
the
case.

The voltage scaling makes the cooling effect efficient not the
frequency.

For example:
         opp5 = opp(2GHz, 1V) => 2 BogoWatt
         opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt
         opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt
         [ other states but we focus on these 3 ]

opp5->opp4 => -5% compute capacity, -5% power, ratio=1
opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6

opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9

In burst operation (no thermal throttling), opp4 is pointless we agree
on that.

IMO the following will happen: in burst operation with thermal
throttling we hit the trip point and then the step wise governor
reduces
opp5 -> opp4. We have slight power reduction but the temperature does
not decrease, so at the next iteration, it is throttle at opp3. And at
the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3.

It is probable we end up with an equivalent frequency average (or
compute capacity avg).

opp4 <-> opp3 (longer duration in states, less transitions)
opp5 <-> opp3 (shorter duration in states, more transitions)

Some platforms had their higher OPPs with the same voltage and they
failed to cool down the CPU in the long run.

Anyway, there is only one way to check it out :)

Alexey, is it possible to compare the compute duration for 'dhrystone'
with these voltage OPP and without ? (with a period of cool down
between
the test in order to start at the same thermal condition) ?

Sure, let me try that - would be interesting to see the results. In my
previous tinkering there were cases when the system stayed at 2.35GHz
for all big cores for non-trivial time (using the step-wise thermal
governor), and that's an example of "same voltage, lower frequency".
Other times though it throttled one cluster down to 1.8GHz and kept
the other at 2.4GHz, and was also stationary at those parameters for
extended time. This probably indicates that both of those states use
sustainable power in my cooling setup.

IMHO, there are simply too many factors at play, including different
possible cooling setups, so providing additional CPU throttling
granularity can only be helpful.  Of course, testing and recording
data is the way to move forward, but I think we should use a few
different tests.

Soooo, benchmarking these turned out a bit trickier than I had hoped
for. Apparently, dhrystone uses an unsigned int rather than an
unsigned long for the loops count (or something of that sort), which
means that I can't get it to run enough loops to heat up my chip from
a stable idle state to the throttling state (due to counter
wraparound). So I ended up with a couple of crutches, namely:
  - run dhrystone continuously on 6 out of 8 cores to make the chip
warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on
my machine cores 6-7 are usually the first ones to get throttled, due
to whatever thermal peculiarities)
  - wait for the temperature to stabilize (which happens at 79.5C)
  - then run timed dhrystone on the remaining 2 out of 6 cores (big
ones) to see how throttling with different OPP tables affects overall
performance.

Thanks for taking the time to test.

In the end, here's what I got with the 'original' OPP table (including
"same voltage - different frequencies" states):
alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
duration: 0 seconds
number of threads: 2
number of loops: 4000000000000000
delay between starting threads: 0 seconds

Dhrystone(1.1) time for 1233977344 passes = 29.7
This machine benchmarks at 41481539 dhrystones/second
                            23609 DMIPS
Dhrystone(1.1) time for 1233977344 passes = 29.8
This machine benchmarks at 41476618 dhrystones/second
                            23606 DMIPS

Total dhrystone run time: 30.864492 seconds.

And here's what I got with the 'reduced' OPP table (keeping only the
highest frequency state for each voltage):
alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
duration: 0 seconds
number of threads: 2
number of loops: 4000000000000000
delay between starting threads: 0 seconds

Dhrystone(1.1) time for 1233977344 passes = 30.9
This machine benchmarks at 39968549 dhrystones/second
                           22748 DMIPS
Dhrystone(1.1) time for 1233977344 passes = 31.0
This machine benchmarks at 39817431 dhrystones/second
                           22662 DMIPS

Total dhrystone run time: 31.995136 seconds.

Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in
performance in this setup. This is probably far from a reliable
estimate, but I guess it indeed indicates that having lower-frequency
states might be beneficial in some load scenarios.

What is the duration between these two tests?

I would be curious if it is repeatable by inverting the setup (reduced OPP table and then original OPP table).

BTW: I used -l 10000 for a ~30 seconds workload more or less on the rk3399, may be -l 20000 will be ok for the rk3588.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog





[Index of Archives]     [Device Tree Compilter]     [Device Tree Spec]     [Linux Driver Backports]     [Video for Linux]     [Linux USB Devel]     [Linux PCI Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Yosemite Backpacking]


  Powered by Linux