On 3.01.2025 3:49 PM, Neil Armstrong wrote: > On 03/01/2025 15:43, Konrad Dybcio wrote: >> On 3.01.2025 3:38 PM, Neil Armstrong wrote: >>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for >>> the CPUs and GPU is handled by hardware & firmware using factory and >>> form-factor determined parameters in order to maximize frequency while >>> keeping the temperature way below the junction temperature where the SoC >>> would experience a thermal shutdown if not permanent damages. >>> >>> On the other side, the High Level Ooperating System (HLOS), like Linux, >>> is able to adjust the CPU and GPU frequency using the internal SoC >>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it >>> effectly does the same work twice in an less effective manner. >>> >>> Let's take the Hardware & Firmware action in account and design the >>> thermal zones trip points and cooling devices mapping to use the HLOS >>> as a safety warant in case the platform experiences a temperature surge >>> to helpfully avoid a thermal shutdown and handle the scenario gracefully. >>> >>> On the CPU side, the LMh hardware does the DCVS control loop, so >>> let's set higher trip points temperatures closer to the junction >>> and thermal shutdown temperatures and add some idle injection cooling >>> device with 100% duty cycle for each CPU that would act as emergency >>> action to avoid the thermal shutdown. >>> >>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS >>> control loop, but since we can't perform idle injection, let's >>> also set higher trip points temperatures closer to the junction >>> and thermal shutdown temperatures to reduce the GPU frequency only >>> as an emergency action before the thermal shutdown. We could probably work out some mechanism for drm to say "gpu is too hot / too busy" and stall the userspace's requests.. If that doesn't exist already (+RobC) >>> >>> Those 2 changes optimizes the thermal management design by avoiding >>> concurrent thermal management, calculations & avoidable interrupts >>> by moving the HLOS management to a last resort emergency if the >>> Hardware & Firmwares fails to avoid a thermal shutdown. >>> >>> Signed-off-by: Neil Armstrong <neil.armstrong@xxxxxxxxxx> >>> --- >> >> Got any numbers to back this? > > To back which part ? Yes I've been running loads with difference > scenarios and effectively the hardware work is much better with > a more linear correction and slighly better performances because > it sets slighly higger OPPs while maintaining the core closer to > the target temperature range. Which is kind of expected. > > I don't have easy numbers to share, sorry... Ok, what you said above sounds good already. Konrad