Re: Ticket #2382

Jean Delvare <khali@xxxxxxxxxxxx> · Tue, 19 Nov 2013 18:18:57 +0100

Hi Guenter, Mike,

On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> > 
> > Guenter,
> > 
> > We're evaluating the new card in a open chassis. It is on the test
> > bench with a table fan for cooling. I turned off the fan and got:
> > 
> >     ENTER show_temp
> >     cpu 0 (0)
> >     status_reg @ 19C
> >     eax = 885E0000 edx = 0
> >     temp = 1770 valid = 1
> >     EXIT show_temp
> > 
> > It seems like you've seen this before. What's going on?
> 
> No, I was just throwing darts at a wall with my eyes closed.

Oh, you thought that was a wall? :D

> Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> if the temperature is too low to be even remotely close to the maximum.

That was my theory in ticket #2382, indeed. It was never tested until
today I think, thanks Mike for doing that.

> For this chip, just to give you an example, the datasheet says that any
> reported temperature below 50 degrees C only means that the temperature
> is below 50 degrees C.

That's a start... I didn't know it was documented. Is it documented for
all CPU models? If we can gather the values at least for all affected
Atom CPU models (as I suppose the value will vary per model) we could
tweak something in the driver.

> Jean, any idea what we can do about this ? Report X degrees C (some constant
> below TjMax) if valid is 0 ?

Well well, we don't really have a sane way to transmit the information
("temperature is below X") down to the monitoring applications. The
sysfs interface has no provision for it, libsensors wouldn't handle it
and "sensors" wouldn't either, of course.

We could hard-code an arbitrarily low temperature as you suggest,
however I'm not sure if we want to do it for all CPU models or only the
ones listed in ticket #2382. My concern is that the Intel specification
doesn't limit "valid = 0" to too low temperature values. They don't
give any detail, so assuming that "too low" is the only reason seems
weird. I remember we saw transient errors on coretemp readings in the
past, but I can't remember if that was on these Atom models (i.e. just
another incarnation of ticket #2382) or other CPU models. I'm afraid we
may start reporting temperature values instead of actual errors if the
fix-up is too broad.

Either way, the current situation is rather bad, as "N/A" looks more
like "it's broken" than "it's cold". So I have no objection to crafting
"something" into the driver to make it look better, if you are
motivated to give it a try.

If you are even more motivated and want to extend the sysfs to properly
report the situation to user-space, feel free to do that as well. I
volunteer to review any kernel patch related to this, and to write the
user-space code to deal with it. I'm just not sure it's worth the
effort for just 3 CPU models.

Thanks,
-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@xxxxxxxxxxxxxx
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors