On 11.01.2025 18:00, Andrew Lunn wrote: >> According to Guenters feedback the alarm attribute must not be written >> and is expected to be self-clearing on read. >> If we would clear the alarm in the chip on alarm attribute read, then >> we can have the following ugly scenario: >> >> 1. Temperature threshold is exceeded and chip reduces speed to 1Gbps >> 2. Temperature is falling below alarm threshold >> 3. User uses "sensors" to check the current temperature >> 4. The implicit alarm attribute read causes the chip to clear the >> alarm and re-enable 2.5Gbps speed, resulting in the temperature >> alarm threshold being exceeded very soon again. >> >> What isn't nice here is that it's not transparent to the user that >> a read-only command from his perspective causes the protective measure >> of the chip to be cancelled. >> >> There's no existing hwmon attribute meant to be used by the user >> to clear a hw alarm once he took measures to protect the chip >> from overheating. > > It is generally not the kernels job to implement policy. User space > should be doing that. > > I see two different possible policies, and there are maybe others: > > 1) The user is happy with one second outages every so often as the > chip cycles between too hot and down shifting, and cool enough to > upshift back to the higher speeds. > > 2) The user prefers to have reliable, slower connectivity and needs to > explicitly do something like down/up the interface to get it back to > the higher speed. > This seems to be exactly how I do it currently. > I personally would say, from a user support view, 2) is better. A one > time 1 second break in connectivity and a kernel message is going to > cause less issues. > > Maybe the solution is that the hwmon alarm attribute is not directly > the hardware bit, but a software interpretation of the system state. > When the alarm fires, copy it into a software alarm state, but leave > the hardware alarm alone. A hwmon read clears the software state, but > leaves the hardware alone. A down/up of the interface will then clear > both the software and hardware alarm state. > Not clearing the alarm on read is better from a user perspective IMO (at least for this specific PHY). As long as the alarm is active, the chip forces a downshift. > Anybody wanting policy 1) would then need a daemon polling the state > and taking action. 2) would be the default. > > How easy is it for you to get into the alarm state? Did you need an > environment chamber/oven, or is it happening for you with just lots of > continuous traffic at typical room temperature? Are we talking about > cheap USB dangles in a sealed plastic case with poor thermal design > are going to be doing this all the time? > I have a M.2 card with RTL8126 (w/o heat sink) and an external RJ45 port. This card sits in a slot underneath the mainboard of a mini PC. At 2.5Gbps it makes a big difference whether EEE is active. With EEE it reaches 54°C, w/o EEE temperature quickly goes over 70°C. For tests I add a PHY write to the code which sets the over-temp threshold to 60°C. Then I can easily trigger overheating by disabling EEE. On my system the over-temp threshold set by the BIOS (?) is 120°C. Even w/o heat sink I can hardly imagine that this threshold is ever reached. > Andrew Heiner