Re: [PATCH net-next 3/3] net: phy: realtek: add hwmon support for temp sensor on RTL822x

Heiner Kallweit <hkallweit1@xxxxxxxxx> · Sat, 11 Jan 2025 18:32:35 +0100

On 11.01.2025 18:00, Andrew Lunn wrote:
>> According to Guenters feedback the alarm attribute must not be written
>> and is expected to be self-clearing on read.
>> If we would clear the alarm in the chip on alarm attribute read, then
>> we can have the following ugly scenario:
>>
>> 1. Temperature threshold is exceeded and chip reduces speed to 1Gbps
>> 2. Temperature is falling below alarm threshold
>> 3. User uses "sensors" to check the current temperature
>> 4. The implicit alarm attribute read causes the chip to clear the
>>    alarm and re-enable 2.5Gbps speed, resulting in the temperature
>>    alarm threshold being exceeded very soon again.
>>
>> What isn't nice here is that it's not transparent to the user that
>> a read-only command from his perspective causes the protective measure
>> of the chip to be cancelled.
>>
>> There's no existing hwmon attribute meant to be used by the user
>> to clear a hw alarm once he took measures to protect the chip
>> from overheating.
> 
> It is generally not the kernels job to implement policy. User space
> should be doing that.
> 
> I see two different possible policies, and there are maybe others:
> 
> 1) The user is happy with one second outages every so often as the
> chip cycles between too hot and down shifting, and cool enough to
> upshift back to the higher speeds.
> 
> 2) The user prefers to have reliable, slower connectivity and needs to
> explicitly do something like down/up the interface to get it back to
> the higher speed.
> 
This seems to be exactly how I do it currently.

> I personally would say, from a user support view, 2) is better. A one
> time 1 second break in connectivity and a kernel message is going to
> cause less issues.
> 
> Maybe the solution is that the hwmon alarm attribute is not directly
> the hardware bit, but a software interpretation of the system state.
> When the alarm fires, copy it into a software alarm state, but leave
> the hardware alarm alone. A hwmon read clears the software state, but
> leaves the hardware alone. A down/up of the interface will then clear
> both the software and hardware alarm state.
> 
Not clearing the alarm on read is better from a user perspective IMO
(at least for this specific PHY).
As long as the alarm is active, the chip forces a downshift. 

> Anybody wanting policy 1) would then need a daemon polling the state
> and taking action. 2) would be the default.
> 
> How easy is it for you to get into the alarm state? Did you need an
> environment chamber/oven, or is it happening for you with just lots of
> continuous traffic at typical room temperature? Are we talking about
> cheap USB dangles in a sealed plastic case with poor thermal design
> are going to be doing this all the time?
> 
I have a M.2 card with RTL8126 (w/o heat sink) and an external RJ45 port.
This card sits in a slot underneath the mainboard of a mini PC. At 2.5Gbps
it makes a big difference whether EEE is active. With EEE it reaches 54°C,
w/o EEE temperature quickly goes over 70°C. For tests I add a PHY write
to the code which sets the over-temp threshold to 60°C. Then I can
easily trigger overheating by disabling EEE.

On my system the over-temp threshold set by the BIOS (?) is 120°C.
Even w/o heat sink I can hardly imagine that this threshold is ever
reached.

> 	Andrew

Heiner