Re: [PATCH 0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones

Neil Armstrong <neil.armstrong@xxxxxxxxxx> · Fri, 3 Jan 2025 15:49:39 +0100

On 03/01/2025 15:43, Konrad Dybcio wrote:
On 3.01.2025 3:38 PM, Neil Armstrong wrote:
On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
the CPUs and GPU is handled by hardware & firmware using factory and
form-factor determined parameters in order to maximize frequency while
keeping the temperature way below the junction temperature where the SoC
would experience a thermal shutdown if not permanent damages.

On the other side, the High Level Ooperating System (HLOS), like Linux,
is able to adjust the CPU and GPU frequency using the internal SoC
temperature sensors (here tsens) and it's UP/LOW interrupts, but it
effectly does the same work twice in an less effective manner.

Let's take the Hardware & Firmware action in account and design the
thermal zones trip points and cooling devices mapping to use the HLOS
as a safety warant in case the platform experiences a temperature surge
to helpfully avoid a thermal shutdown and handle the scenario gracefully.

On the CPU side, the LMh hardware does the DCVS control loop, so
let's set higher trip points temperatures closer to the junction
and thermal shutdown temperatures and add some idle injection cooling
device with 100% duty cycle for each CPU that would act as emergency
action to avoid the thermal shutdown.

On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
control loop, but since we can't perform idle injection, let's
also set higher trip points temperatures closer to the junction
and thermal shutdown temperatures to reduce the GPU frequency only
as an emergency action before the thermal shutdown.

Those 2 changes optimizes the thermal management design by avoiding
concurrent thermal management, calculations & avoidable interrupts
by moving the HLOS management to a last resort emergency if the
Hardware & Firmwares fails to avoid a thermal shutdown.

Signed-off-by: Neil Armstrong <neil.armstrong@xxxxxxxxxx>
---

Got any numbers to back this?

To back which part ? Yes I've been running loads with difference
scenarios and effectively the hardware work is much better with
a more linear correction and slighly better performances because
it sets slighly higger OPPs while maintaining the core closer to
the target temperature range. Which is kind of expected.

I don't have easy numbers to share, sorry...

So yes I consider avoiding the concurrent effort is better, but
since we also take the firmware design in account in the whole platform
representation in DT (DSPs, SCM, GMU, ...) we should also extend this
to thermal.

Neil

Konrad