Re: Ticket #2382

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 11/19/2013 12:53 PM, Guenter Roeck wrote:
On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote:
Hi Guenter, Mike,

On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
Guenter,

We're evaluating the new card in a open chassis. It is on the test
bench with a table fan for cooling. I turned off the fan and got:

     ENTER show_temp
     cpu 0 (0)
     status_reg @ 19C
     eax = 885E0000 edx = 0
     temp = 1770 valid = 1
     EXIT show_temp

It seems like you've seen this before. What's going on?
No, I was just throwing darts at a wall with my eyes closed.
Oh, you thought that was a wall? :D

Seriously, it was just a wild guess. Idea was that the valid bit may be 0
if the temperature is too low to be even remotely close to the maximum.
That was my theory in ticket #2382, indeed. It was never tested until
today I think, thanks Mike for doing that.

For this chip, just to give you an example, the datasheet says that any
reported temperature below 50 degrees C only means that the temperature
is below 50 degrees C.
That's a start... I didn't know it was documented. Is it documented for
all CPU models? If we can gather the values at least for all affected
Uuh ... I didn't say it was documented. If it is, I don't know about it.
As I said, it was just a wild guess.... even without reading your comment
on the ticket.

Atom CPU models (as I suppose the value will vary per model) we could
tweak something in the driver.

Jean, any idea what we can do about this ? Report X degrees C (some constant
below TjMax) if valid is 0 ?
Well well, we don't really have a sane way to transmit the information
("temperature is below X") down to the monitoring applications. The
sysfs interface has no provision for it, libsensors wouldn't handle it
and "sensors" wouldn't either, of course.

We could hard-code an arbitrarily low temperature as you suggest,
however I'm not sure if we want to do it for all CPU models or only the
ones listed in ticket #2382. My concern is that the Intel specification
doesn't limit "valid = 0" to too low temperature values. They don't
give any detail, so assuming that "too low" is the only reason seems
weird. I remember we saw transient errors on coretemp readings in the
past, but I can't remember if that was on these Atom models (i.e. just
another incarnation of ticket #2382) or other CPU models. I'm afraid we
may start reporting temperature values instead of actual errors if the
fix-up is too broad.

Either way, the current situation is rather bad, as "N/A" looks more
like "it's broken" than "it's cold". So I have no objection to crafting
"something" into the driver to make it look better, if you are
motivated to give it a try.

If you are even more motivated and want to extend the sysfs to properly
report the situation to user-space, feel free to do that as well. I
volunteer to review any kernel patch related to this, and to write the
user-space code to deal with it. I'm just not sure it's worth the
effort for just 3 CPU models.

I'd rather go with an exception table, or rather extend the existing tables.
It is probably somewhat safe to assume that the problem applies to all CPUs
with the same model/mask. Based on that we could declare a "tjmin" and
report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
numbers, would then be 36 degrees C (100 - 64).

If you are ok with that I'll submit a patch for it.

Guenter

I plotted out the data and a I think a fair approximation formula is:

Celsius == (((60/100) * return-value) + 40);

So temperatures less than 40 are reported as 40 and temperatures over 100 cause thermal shut-down and it doesn't matter.

Have fun,
Mike


_______________________________________________
lm-sensors mailing list
lm-sensors@xxxxxxxxxxxxxx
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors




[Index of Archives]     [Linux Kernel]     [Linux Hardware Monitoring]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux