On Wed, 2011-06-01 at 05:40 -0400, Jean Delvare wrote: > Hi Fenghua, > > On Tue, 31 May 2011 15:56:11 -0700, Yu, Fenghua wrote: > > > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote: > > > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My > > > > understanding is if the MSR is implemented, its bits 16..23 have the > > > > Temperature Target. Otherwise it's a hardware bug. > > This doesn't seem to be the case in practice, see below. > > > > > There is no spec for the Temperature Target range in SDM (can you > > > > find one?). Whatever range we put in coretemp is just speculation. > > > > Strictly speaking, we shouldn't have the range check. > > I tend to agree. > > > > I'd be somewhat hesitant to make such a change right away. Maybe we can > > > replace the check with a check against 0 and display a warning (or even > > > an error) if it reads 0. Would that make sense ? > > Almost, except that I don't think we should complain about value 0, see > below. > > > After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be: > > 1. a design issue which should be detected and fixed during internal design validation; > > 2. a manufacture issue which should be detected and fixed during manufacture process. > > > > There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc? > > 0 is different because it is what unused/unimplemented MSR bits return. > I have 3 supported CPUs here for my testing: family 6 model 0xe (Core > Duo) in a laptop, family 6 model 0xf (Core2 Duo) in a desktop and > family 6 model 0x1a in a workstation (Xeon). > > On model 0xe, reading from MSR 0x1a2 return an I/O error. On model 0xf, > the MSR exists, but reading from it returns 0 for bits 23:16. On model > 0x1a, it returns a valid temperature limit in bits 23:16 (97ÂC). > > > > > Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless. > > > > A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS. > > I would be perfectly happy to not have any sanity check on the value, > but this requires that we know upfront which CPU models have bits 23:16 > of MSR 0x1a2 implemented, and which do not. As my experiment above > shows, some models have the MSR implemented but not bits 23:16, and > checking for value 0 is the only way to handle them in a generic way. > > According to the SDM, MSR 0x1a2 is supposed to only exist in Nehalem > (i.e. family 6 model 0x1a) and Sandy Bridge CPUs (family and model > unknown to me) but I guess it will also exist in future CPUs. > > I see two possible implementations: > > 1* Go by the SDM. First check that we are on a model which supports > MSR_TEMPERATURE_TARGET before calling it, and always trust the result. > For other models, use the heuristic we had before (adjust_tjmax). The > major drawback is that we will have to update the list of CPU models > supporting MSR_TEMPERATURE_TARGET as they are released by Intel. We > just went away from having to continuously maintain a device list in > the coretemp driver and it seems wrong to go back to this situation > again. > > 2* Go by my experimental result. Read MSR_TEMPERATURE_TARGET > unconditionally, but only use the result if there was no I/O error and > the value isn't 0. In other cases, don't emit any warning message and > simply fallback to the heuristic we had before (adjust_tjmax). This is > closer to what we have at the moment, and has the advantage that it > won't require further updates. > > Implementation 2 clearly has my favors. > Definitely. 1* doesn't look like very compelling. > For stable kernel series, I'd rather just apply the patch I submitted, > because it is the least intrusive option and we aren't aware of any > actual system requiring a more intrusive fix. > Do we really want/need this in -stable ? Just wondering. We do need the second part of 2*, ie unconditionally calling adjust_tjmax(). Thanks, Guenter _______________________________________________ lm-sensors mailing list lm-sensors@xxxxxxxxxxxxxx http://lists.lm-sensors.org/mailman/listinfo/lm-sensors