Re: Ticket #2382

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 19, 2013 at 08:41:01PM +0100, Jean Delvare wrote:
> On Tue, 19 Nov 2013 09:53:51 -0800, Guenter Roeck wrote:
> > On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote:
> > > Hi Guenter, Mike,
> > > 
> > > On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
> > > > On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> > > > > 
> > > > > Guenter,
> > > > > 
> > > > > We're evaluating the new card in a open chassis. It is on the test
> > > > > bench with a table fan for cooling. I turned off the fan and got:
> > > > > 
> > > > >     ENTER show_temp
> > > > >     cpu 0 (0)
> > > > >     status_reg @ 19C
> > > > >     eax = 885E0000 edx = 0
> > > > >     temp = 1770 valid = 1
> > > > >     EXIT show_temp
> > > > > 
> > > > > It seems like you've seen this before. What's going on?
> > > > 
> > > > No, I was just throwing darts at a wall with my eyes closed.
> > > 
> > > Oh, you thought that was a wall? :D
> > > 
> > > > Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> > > > if the temperature is too low to be even remotely close to the maximum.
> > > 
> > > That was my theory in ticket #2382, indeed. It was never tested until
> > > today I think, thanks Mike for doing that.
> > > 
> > > > For this chip, just to give you an example, the datasheet says that any
> > > > reported temperature below 50 degrees C only means that the temperature
> > > > is below 50 degrees C.
> > > 
> > > That's a start... I didn't know it was documented. Is it documented for
> > > all CPU models? If we can gather the values at least for all affected
> > 
> > Uuh ... I didn't say it was documented. If it is, I don't know about it.
> > As I said, it was just a wild guess.... even without reading your comment
> > on the ticket.
> 
> I must have misread you. What where you talking about when you said
> "For this chip, just to give you an example, the datasheet says that
> any reported temperature below 50 degrees C only means that the
> temperature is below 50 degrees C"?
> 
It does. Sorry, I thought you refer to the valid bit. My bad.

The exact wording is "Any DTS reading below 50°C should be considered
to indicate only a temperature below 50°C and not a specific temperature".
This is from Intel® Atom™ Processor D400 and D500 Series Datasheet,
Volume 1, "7.1.3 Digital Thermal Sensor".

Just for fun, I also checked the datasheets for Z5xx, Z6xx, N400/N500, and
D2000/N2000. The D200/N2000 datasheets says "Any temperature below 25 ...",
the others are silent on the subject.

> > > Atom CPU models (as I suppose the value will vary per model) we could
> > > tweak something in the driver.
> > > 
> > > > Jean, any idea what we can do about this ? Report X degrees C (some constant
> > > > below TjMax) if valid is 0 ?
> > > 
> > > Well well, we don't really have a sane way to transmit the information
> > > ("temperature is below X") down to the monitoring applications. The
> > > sysfs interface has no provision for it, libsensors wouldn't handle it
> > > and "sensors" wouldn't either, of course.
> > > 
> > > We could hard-code an arbitrarily low temperature as you suggest,
> > > however I'm not sure if we want to do it for all CPU models or only the
> > > ones listed in ticket #2382. My concern is that the Intel specification
> > > doesn't limit "valid = 0" to too low temperature values. They don't
> > > give any detail, so assuming that "too low" is the only reason seems
> > > weird. I remember we saw transient errors on coretemp readings in the
> > > past, but I can't remember if that was on these Atom models (i.e. just
> > > another incarnation of ticket #2382) or other CPU models. I'm afraid we
> > > may start reporting temperature values instead of actual errors if the
> > > fix-up is too broad.
> > > 
> > > Either way, the current situation is rather bad, as "N/A" looks more
> > > like "it's broken" than "it's cold". So I have no objection to crafting
> > > "something" into the driver to make it look better, if you are
> > > motivated to give it a try.
> > > 
> > > If you are even more motivated and want to extend the sysfs to properly
> > > report the situation to user-space, feel free to do that as well. I
> > > volunteer to review any kernel patch related to this, and to write the
> > > user-space code to deal with it. I'm just not sure it's worth the
> > > effort for just 3 CPU models.
> > 
> > I'd rather go with an exception table, or rather extend the existing tables.
> > It is probably somewhat safe to assume that the problem applies to all CPUs
> > with the same model/mask. Based on that we could declare a "tjmin" and
> > report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> > temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> > numbers, would then be 36 degrees C (100 - 64).
> 
> Not sure where you drew the "36" from. From Mike's table it seems the
> valid flag wears off when the reported temperature would be < 6°C. This
> correlates with my findings in the ticket where the valid flag would be
> 0 for 1°C and 4°C.
> 
You are right. No idea myself; maybe it was too early and I didn't have enough coffee.

How about that: define tjmin at X degrees C, and report that temperature if
valid==0 or if the reported temperature is lower. Would that make sense ?

Only question remains what X should be for model 0x1c/10. 25 ? 30 ?

Thanks,
Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@xxxxxxxxxxxxxx
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors





[Index of Archives]     [Linux Kernel]     [Linux Hardware Monitoring]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux