On Sun, Jan 28, 2024 at 7:06 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote: > > > Hi Alexey, Hi Daniel, > On 27/01/2024 20:41, Alexey Charkov wrote: > > On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote: > >> > >> On 2024-01-26 14:44, Alexey Charkov wrote: > >>> On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano > >>> <daniel.lezcano@xxxxxxxxxx> wrote: > >>>> On 26/01/2024 08:49, Dragan Simic wrote: > >>>>> On 2024-01-26 08:30, Alexey Charkov wrote: > >>>>>> On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote: > >>>>>>> On 2024-01-26 07:44, Alexey Charkov wrote: > >>>>>>>> On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic@xxxxxxxxxxx> > >>>>>>>> wrote: > >>>>>>>>> On 2024-01-25 10:30, Daniel Lezcano wrote: > >>>>>>>>>> On 24/01/2024 21:30, Alexey Charkov wrote: > >>>>>>>>>>> By default the CPUs on RK3588 start up in a conservative > >>>>>>> performance > >>>>>>>>>>> mode. Add frequency and voltage mappings to the device tree to > >>>>>>> enable > >>>> > >>>> [ ... ] > >>>> > >>>>>> Throttling would also lower the voltage at some point, which cools it > >>>>>> down much faster! > >>>>> > >>>>> Of course, but the key is not to cool (and slow down) the CPU cores too > >>>>> much, but just enough to stay within the available thermal envelope, > >>>>> which is where the same-voltage, lower-frequency OPPs should shine. > >>>> > >>>> That implies the resulting power is sustainable which I doubt it is > >>>> the > >>>> case. > >>>> > >>>> The voltage scaling makes the cooling effect efficient not the > >>>> frequency. > >>>> > >>>> For example: > >>>> opp5 = opp(2GHz, 1V) => 2 BogoWatt > >>>> opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt > >>>> opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt > >>>> [ other states but we focus on these 3 ] > >>>> > >>>> opp5->opp4 => -5% compute capacity, -5% power, ratio=1 > >>>> opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6 > >>>> > >>>> opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9 > >>>> > >>>> In burst operation (no thermal throttling), opp4 is pointless we agree > >>>> on that. > >>>> > >>>> IMO the following will happen: in burst operation with thermal > >>>> throttling we hit the trip point and then the step wise governor > >>>> reduces > >>>> opp5 -> opp4. We have slight power reduction but the temperature does > >>>> not decrease, so at the next iteration, it is throttle at opp3. And at > >>>> the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3. > >>>> > >>>> It is probable we end up with an equivalent frequency average (or > >>>> compute capacity avg). > >>>> > >>>> opp4 <-> opp3 (longer duration in states, less transitions) > >>>> opp5 <-> opp3 (shorter duration in states, more transitions) > >>>> > >>>> Some platforms had their higher OPPs with the same voltage and they > >>>> failed to cool down the CPU in the long run. > >>>> > >>>> Anyway, there is only one way to check it out :) > >>>> > >>>> Alexey, is it possible to compare the compute duration for 'dhrystone' > >>>> with these voltage OPP and without ? (with a period of cool down > >>>> between > >>>> the test in order to start at the same thermal condition) ? > >>> > >>> Sure, let me try that - would be interesting to see the results. In my > >>> previous tinkering there were cases when the system stayed at 2.35GHz > >>> for all big cores for non-trivial time (using the step-wise thermal > >>> governor), and that's an example of "same voltage, lower frequency". > >>> Other times though it throttled one cluster down to 1.8GHz and kept > >>> the other at 2.4GHz, and was also stationary at those parameters for > >>> extended time. This probably indicates that both of those states use > >>> sustainable power in my cooling setup. > >> > >> IMHO, there are simply too many factors at play, including different > >> possible cooling setups, so providing additional CPU throttling > >> granularity can only be helpful. Of course, testing and recording > >> data is the way to move forward, but I think we should use a few > >> different tests. > > > > Soooo, benchmarking these turned out a bit trickier than I had hoped > > for. Apparently, dhrystone uses an unsigned int rather than an > > unsigned long for the loops count (or something of that sort), which > > means that I can't get it to run enough loops to heat up my chip from > > a stable idle state to the throttling state (due to counter > > wraparound). So I ended up with a couple of crutches, namely: > > - run dhrystone continuously on 6 out of 8 cores to make the chip > > warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on > > my machine cores 6-7 are usually the first ones to get throttled, due > > to whatever thermal peculiarities) > > - wait for the temperature to stabilize (which happens at 79.5C) > > - then run timed dhrystone on the remaining 2 out of 6 cores (big > > ones) to see how throttling with different OPP tables affects overall > > performance. > > Thanks for taking the time to test. > > > In the end, here's what I got with the 'original' OPP table (including > > "same voltage - different frequencies" states): > > alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000 > > duration: 0 seconds > > number of threads: 2 > > number of loops: 4000000000000000 > > delay between starting threads: 0 seconds > > > > Dhrystone(1.1) time for 1233977344 passes = 29.7 > > This machine benchmarks at 41481539 dhrystones/second > > 23609 DMIPS > > Dhrystone(1.1) time for 1233977344 passes = 29.8 > > This machine benchmarks at 41476618 dhrystones/second > > 23606 DMIPS > > > > Total dhrystone run time: 30.864492 seconds. > > > > And here's what I got with the 'reduced' OPP table (keeping only the > > highest frequency state for each voltage): > > alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000 > > duration: 0 seconds > > number of threads: 2 > > number of loops: 4000000000000000 > > delay between starting threads: 0 seconds > > > > Dhrystone(1.1) time for 1233977344 passes = 30.9 > > This machine benchmarks at 39968549 dhrystones/second > > 22748 DMIPS > > Dhrystone(1.1) time for 1233977344 passes = 31.0 > > This machine benchmarks at 39817431 dhrystones/second > > 22662 DMIPS > > > > Total dhrystone run time: 31.995136 seconds. > > > > Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in > > performance in this setup. This is probably far from a reliable > > estimate, but I guess it indeed indicates that having lower-frequency > > states might be beneficial in some load scenarios. > > What is the duration between these two tests? Several hours and a couple of reboots. I did the first one, recorded the results and the temperatures, then rebuilt the dtb the next day, rebooted with it and did everything again with the other OPP table. > I would be curious if it is repeatable by inverting the setup (reduced > OPP table and then original OPP table). Frankly, I can't see how ordering could have mattered, given that I let the system cool down completely, and also rebooted it to use a different dtb, so there shouldn't have been any caching effects. Maybe there is some outside randomness in the results though - perhaps 5-10 repetitions in each case would have been more statistically meaningful. But then again to make it statistically meaningful I'd have to peg the other (non-benchmarked) cores to a static OPP to ensure the thermal governor doesn't play with them when not asked to - and it all starts to sound like a rabbit hole :) > BTW: I used -l 10000 for a ~30 seconds workload more or less on the > rk3399, may be -l 20000 will be ok for the rk3588. -l 20000 with two threads also gives me about ~30 seconds runtime... While -l 200000 completed in 25 seconds *facepalm* Best regards, Alexey