On 8/5/22 15:07, Jean Delvare wrote:
The wdat_wdt driver is misusing the min_hw_heartbeat_ms field. This field should only be used when the hardware watchdog device should not be pinged more frequently than a specific period. The ACPI WDAT "Minimum Count" field, on the other hand, specifies the minimum timeout value that can be set. This corresponds to the min_timeout field in Linux's watchdog infrastructure. Setting min_hw_heartbeat_ms instead can cause pings to the hardware to be delayed when there is no reason for that, eventually leading to unexpected firing of the watchdog timer (and thus unexpected reboot). I'm also changing max_hw_heartbeat_ms to max_timeout for symmetry, although the use of this one isn't fundamentally wrong, but there is also no reason to enable the software-driven ping mechanism for the wdat_wdt driver.
Normally I would reject this because it is not only unnecessary and unrelated to the problem at hand (remember: one logical change per patch), but it is hidden in an unrelated patch, it will only make life harder later on if/when full milli-second timeouts are introduced, and it may result in unexpected limitations on the maximum timeout. However, Mike accepted it, so who am I to complain.
Signed-off-by: Jean Delvare <jdelvare@xxxxxxx> Fixes: 058dfc767008 ("ACPI / watchdog: Add support for WDAT hardware watchdog") Cc: Wim Van Sebroeck <wim@xxxxxxxxxxxxxxxxxx> Cc: Guenter Roeck <linux@xxxxxxxxxxxx> Cc! Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx> Cc: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> --- Untested, as I have no supported hardware at hand. Note to the watchdog subsystem maintainers: I must say I find the whole thing pretty confusing. First of all, the name symmetry between min_hw_heartbeat_ms and max_hw_heartbeat_ms, while these properties are completely unrelated, is heavily misleading. max_hw_heartbeat_ms is really max_hw_timeout and should be renamed to that IMHO, if we keep it at all.
Variable names are hardly ever perfect. I resist renaming variables to avoid rename wars. Feel free to submit patches to improve the documentation if you like.
Secondly, the coexistence of max_timeout and max_hw_heartbeat_ms is also making the code pretty hard to understand and get right. Historically, max_timeout was already supposed to be the maximum hardware timeout value. I don't understand why a new field with that meaning was introduced, subsequently changing the original meaning of max_timeout to become a software-only limit... but only if max_hw_heartbeat_ms is set.
Code is hardly ever perfect. Feel free to submit patches to help improve understanding if you like.
To be honest, I'm not sold to the idea of a software-emulated maximum timeout value above what the hardware can do, but if doing that makes sense in certain situations, then I believe it should be implemented as a boolean flag (named emulate_large_timeout, for example) to complement max_timeout instead of a separate time value. Is there a reason I'm missing, why it was not done that way?
There are watchdogs with very low maximum timeout values, sometimes less than 3 seconds. gpio-wdt is one example - some have a maximum value of 2.5 seconds. rzn1_wd is even more extreme with a maximum of 1 second. With such low values, accuracy is important, second-based limits are insufficient, and there is an actual need for software timeout handling on top of hardware. At the same time, there is actually a need to make timeouts milli-second based instead of second-based, for uses such as medical devices where timeouts need to be short and accurate. The only reason for not implementing this is that the proposals I have seen so far (including mine) were too messy for my liking, and I never had the time to clean it up. Reverting milli-second support would be the completely wrong direction.
Currently, a comment in watchdog.h claims that max_timeout is ignored when max_hw_heartbeat_ms is set. However in watchdog_dev.c, sysfs attribute max_timeout is created unconditionally, and max_hw_heartbeat_ms doesn't have a sysfs attribute. So userspace has no way to know if max_timeout is the hardware limit, or whether software emulation will kick in for a specified timeout value. Also, there is no complaint if both max_hw_heartbeat_ms and max_timeout are set.
As mentioned before, code is hardly ever perfect. Patches to improve the situation are welcome. Thanks, Guenter