Hi Jonathan, thanks for your bugreport. On Tuesday 17 May 2011 09:36:00 Jean Delvare wrote: > On Mon, 16 May 2011 19:32:52 -0700, Guenter Roeck wrote: > > On Mon, May 16, 2011 at 02:23:44PM -0400, Jan Wagner wrote: > > > we got a bugreport[1] against our nagios-plugins package. Unfortunately > > > we are unsure about what "FAULT" means. > > > In case this is a hardware problem of a sensor in form of it got > > > damaged, we would report "CRITICAL", as a problem occured. > > > If this means there is a problem detecting the sensor or something > > > software like problem, we would report "UNKNOWN" as this not means a > > > hardware problem happened. > > > > It is supposed to indicate a HW problem. Here is the text describing the > > sysfs attribute: > > > > "Each input channel may have an associated fault file. This can be used > > > > to notify open diodes, unconnected fans etc. where the hardware > > supports it. When this boolean has value 1, the measurement for that > > channel should not be trusted." > > > > Note that "critical" in the hwmon ABI means that a critical limit has > > been reached. You would get a "critical" alarm in this case. You might > > have a terminology problem if you use "critical" for a hardware failure. > > > > An undetected sensor should not show up in the first place. > > In fact, FAULT can happen in two different cases. First case (most > common) is unused channel by the manufacturer and the channel should > indeed be ignored. Second case is thermal diode dying or fan stalling, > and reporting this makes sense. So I would: > * Ignore sensor channels which report FAULT when you start monitoring. > * Report FAULT as an actual problem if it happens later during > monitoring, for a channel which reported real values before. The > terminology is up to you. (Jean: many thanks for your clarification) This means, that FAULT can be happen, if the hardware conditions are fine and hardware is failing too. On Saturday 26 February 2011 00:07:22 Jonathan Wiltshire wrote: > The attached patch causes check_sensors to return a critical status if > faulty sensors are detected. For nagios-plugins this means, we don't know if there is exactly a problem. We should report "UNKNOWN" via check_sensors if "FAULT" is reported by the sensor. As the source of this may also not a problem with the hardware conditions itself, something like --ignore-fault needs to be implemented too. With kind regards, Jan. -- Never write mail to <waja@xxxxxxxxxxxxxx>, you have been warned! -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GIT d-- s+: a C+++ UL++++ P+ L+++ E--- W+++ N+++ o++ K++ w--- O M V- PS PE Y++ PGP++ t-- 5 X R tv- b+ DI D+ G++ e++ h---- r+++ y++++ ------END GEEK CODE BLOCK------
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ lm-sensors mailing list lm-sensors@xxxxxxxxxxxxxx http://lists.lm-sensors.org/mailman/listinfo/lm-sensors