On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote: > > On 2024-01-26 14:44, Alexey Charkov wrote: > > On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano > > <daniel.lezcano@xxxxxxxxxx> wrote: > >> On 26/01/2024 08:49, Dragan Simic wrote: > >> > On 2024-01-26 08:30, Alexey Charkov wrote: > >> >> On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote: > >> >>> On 2024-01-26 07:44, Alexey Charkov wrote: > >> >>> > On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic@xxxxxxxxxxx> > >> >>> > wrote: > >> >>> >> On 2024-01-25 10:30, Daniel Lezcano wrote: > >> >>> >> > On 24/01/2024 21:30, Alexey Charkov wrote: > >> >>> >> >> By default the CPUs on RK3588 start up in a conservative > >> >>> performance > >> >>> >> >> mode. Add frequency and voltage mappings to the device tree to > >> >>> enable > >> > >> [ ... ] > >> > >> >> Throttling would also lower the voltage at some point, which cools it > >> >> down much faster! > >> > > >> > Of course, but the key is not to cool (and slow down) the CPU cores too > >> > much, but just enough to stay within the available thermal envelope, > >> > which is where the same-voltage, lower-frequency OPPs should shine. > >> > >> That implies the resulting power is sustainable which I doubt it is > >> the > >> case. > >> > >> The voltage scaling makes the cooling effect efficient not the > >> frequency. > >> > >> For example: > >> opp5 = opp(2GHz, 1V) => 2 BogoWatt > >> opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt > >> opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt > >> [ other states but we focus on these 3 ] > >> > >> opp5->opp4 => -5% compute capacity, -5% power, ratio=1 > >> opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6 > >> > >> opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9 > >> > >> In burst operation (no thermal throttling), opp4 is pointless we agree > >> on that. > >> > >> IMO the following will happen: in burst operation with thermal > >> throttling we hit the trip point and then the step wise governor > >> reduces > >> opp5 -> opp4. We have slight power reduction but the temperature does > >> not decrease, so at the next iteration, it is throttle at opp3. And at > >> the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3. > >> > >> It is probable we end up with an equivalent frequency average (or > >> compute capacity avg). > >> > >> opp4 <-> opp3 (longer duration in states, less transitions) > >> opp5 <-> opp3 (shorter duration in states, more transitions) > >> > >> Some platforms had their higher OPPs with the same voltage and they > >> failed to cool down the CPU in the long run. > >> > >> Anyway, there is only one way to check it out :) > >> > >> Alexey, is it possible to compare the compute duration for 'dhrystone' > >> with these voltage OPP and without ? (with a period of cool down > >> between > >> the test in order to start at the same thermal condition) ? > > > > Sure, let me try that - would be interesting to see the results. In my > > previous tinkering there were cases when the system stayed at 2.35GHz > > for all big cores for non-trivial time (using the step-wise thermal > > governor), and that's an example of "same voltage, lower frequency". > > Other times though it throttled one cluster down to 1.8GHz and kept > > the other at 2.4GHz, and was also stationary at those parameters for > > extended time. This probably indicates that both of those states use > > sustainable power in my cooling setup. > > IMHO, there are simply too many factors at play, including different > possible cooling setups, so providing additional CPU throttling > granularity can only be helpful. Of course, testing and recording > data is the way to move forward, but I think we should use a few > different tests. Soooo, benchmarking these turned out a bit trickier than I had hoped for. Apparently, dhrystone uses an unsigned int rather than an unsigned long for the loops count (or something of that sort), which means that I can't get it to run enough loops to heat up my chip from a stable idle state to the throttling state (due to counter wraparound). So I ended up with a couple of crutches, namely: - run dhrystone continuously on 6 out of 8 cores to make the chip warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on my machine cores 6-7 are usually the first ones to get throttled, due to whatever thermal peculiarities) - wait for the temperature to stabilize (which happens at 79.5C) - then run timed dhrystone on the remaining 2 out of 6 cores (big ones) to see how throttling with different OPP tables affects overall performance. In the end, here's what I got with the 'original' OPP table (including "same voltage - different frequencies" states): alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000 duration: 0 seconds number of threads: 2 number of loops: 4000000000000000 delay between starting threads: 0 seconds Dhrystone(1.1) time for 1233977344 passes = 29.7 This machine benchmarks at 41481539 dhrystones/second 23609 DMIPS Dhrystone(1.1) time for 1233977344 passes = 29.8 This machine benchmarks at 41476618 dhrystones/second 23606 DMIPS Total dhrystone run time: 30.864492 seconds. And here's what I got with the 'reduced' OPP table (keeping only the highest frequency state for each voltage): alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000 duration: 0 seconds number of threads: 2 number of loops: 4000000000000000 delay between starting threads: 0 seconds Dhrystone(1.1) time for 1233977344 passes = 30.9 This machine benchmarks at 39968549 dhrystones/second 22748 DMIPS Dhrystone(1.1) time for 1233977344 passes = 31.0 This machine benchmarks at 39817431 dhrystones/second 22662 DMIPS Total dhrystone run time: 31.995136 seconds. Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in performance in this setup. This is probably far from a reliable estimate, but I guess it indeed indicates that having lower-frequency states might be beneficial in some load scenarios. Note though that several seconds after hitting the throttling threshold cores 6-7 were oscillating between 1.608GHz and 1.8GHz in both runs, which implies that the whole difference in performance was due to different speed of initial throttling (i.e. it might be a peculiarity of the step-wise thermal governor operation when it has to go through more cooling states to reach the "steady-state" one). Given that both 1.608GHz and 1.8GHz have no lower-frequency same-voltage siblings in either of the OPP tables, it implies that under prolonged constant load there should be no performance difference at all. Best regards, Alexey