On Tue, Nov 19, 2013 at 08:41:01PM +0100, Jean Delvare wrote: > On Tue, 19 Nov 2013 09:53:51 -0800, Guenter Roeck wrote: > > On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote: > > > Hi Guenter, Mike, > > > > > > On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote: > > > > On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote: > > > > > > > > > > Guenter, > > > > > > > > > > We're evaluating the new card in a open chassis. It is on the test > > > > > bench with a table fan for cooling. I turned off the fan and got: > > > > > > > > > > ENTER show_temp > > > > > cpu 0 (0) > > > > > status_reg @ 19C > > > > > eax = 885E0000 edx = 0 > > > > > temp = 1770 valid = 1 > > > > > EXIT show_temp > > > > > > > > > > It seems like you've seen this before. What's going on? > > > > > > > > No, I was just throwing darts at a wall with my eyes closed. > > > > > > Oh, you thought that was a wall? :D > > > > > > > Seriously, it was just a wild guess. Idea was that the valid bit may be 0 > > > > if the temperature is too low to be even remotely close to the maximum. > > > > > > That was my theory in ticket #2382, indeed. It was never tested until > > > today I think, thanks Mike for doing that. > > > > > > > For this chip, just to give you an example, the datasheet says that any > > > > reported temperature below 50 degrees C only means that the temperature > > > > is below 50 degrees C. > > > > > > That's a start... I didn't know it was documented. Is it documented for > > > all CPU models? If we can gather the values at least for all affected > > > > Uuh ... I didn't say it was documented. If it is, I don't know about it. > > As I said, it was just a wild guess.... even without reading your comment > > on the ticket. > > I must have misread you. What where you talking about when you said > "For this chip, just to give you an example, the datasheet says that > any reported temperature below 50 degrees C only means that the > temperature is below 50 degrees C"? > It does. Sorry, I thought you refer to the valid bit. My bad. The exact wording is "Any DTS reading below 50°C should be considered to indicate only a temperature below 50°C and not a specific temperature". This is from Intel® Atom™ Processor D400 and D500 Series Datasheet, Volume 1, "7.1.3 Digital Thermal Sensor". Just for fun, I also checked the datasheets for Z5xx, Z6xx, N400/N500, and D2000/N2000. The D200/N2000 datasheets says "Any temperature below 25 ...", the others are silent on the subject. > > > Atom CPU models (as I suppose the value will vary per model) we could > > > tweak something in the driver. > > > > > > > Jean, any idea what we can do about this ? Report X degrees C (some constant > > > > below TjMax) if valid is 0 ? > > > > > > Well well, we don't really have a sane way to transmit the information > > > ("temperature is below X") down to the monitoring applications. The > > > sysfs interface has no provision for it, libsensors wouldn't handle it > > > and "sensors" wouldn't either, of course. > > > > > > We could hard-code an arbitrarily low temperature as you suggest, > > > however I'm not sure if we want to do it for all CPU models or only the > > > ones listed in ticket #2382. My concern is that the Intel specification > > > doesn't limit "valid = 0" to too low temperature values. They don't > > > give any detail, so assuming that "too low" is the only reason seems > > > weird. I remember we saw transient errors on coretemp readings in the > > > past, but I can't remember if that was on these Atom models (i.e. just > > > another incarnation of ticket #2382) or other CPU models. I'm afraid we > > > may start reporting temperature values instead of actual errors if the > > > fix-up is too broad. > > > > > > Either way, the current situation is rather bad, as "N/A" looks more > > > like "it's broken" than "it's cold". So I have no objection to crafting > > > "something" into the driver to make it look better, if you are > > > motivated to give it a try. > > > > > > If you are even more motivated and want to extend the sysfs to properly > > > report the situation to user-space, feel free to do that as well. I > > > volunteer to review any kernel patch related to this, and to write the > > > user-space code to deal with it. I'm just not sure it's worth the > > > effort for just 3 CPU models. > > > > I'd rather go with an exception table, or rather extend the existing tables. > > It is probably somewhat safe to assume that the problem applies to all CPUs > > with the same model/mask. Based on that we could declare a "tjmin" and > > report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe" > > temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's > > numbers, would then be 36 degrees C (100 - 64). > > Not sure where you drew the "36" from. From Mike's table it seems the > valid flag wears off when the reported temperature would be < 6°C. This > correlates with my findings in the ticket where the valid flag would be > 0 for 1°C and 4°C. > You are right. No idea myself; maybe it was too early and I didn't have enough coffee. How about that: define tjmin at X degrees C, and report that temperature if valid==0 or if the reported temperature is lower. Would that make sense ? Only question remains what X should be for model 0x1c/10. 25 ? 30 ? Thanks, Guenter _______________________________________________ lm-sensors mailing list lm-sensors@xxxxxxxxxxxxxx http://lists.lm-sensors.org/mailman/listinfo/lm-sensors