Hi Fenghua, On Tue, 31 May 2011 15:56:11 -0700, Yu, Fenghua wrote: > > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote: > > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My > > > understanding is if the MSR is implemented, its bits 16..23 have the > > > Temperature Target. Otherwise it's a hardware bug. This doesn't seem to be the case in practice, see below. > > > There is no spec for the Temperature Target range in SDM (can you > > > find one?). Whatever range we put in coretemp is just speculation. > > > Strictly speaking, we shouldn't have the range check. I tend to agree. > > I'd be somewhat hesitant to make such a change right away. Maybe we can > > replace the check with a check against 0 and display a warning (or even > > an error) if it reads 0. Would that make sense ? Almost, except that I don't think we should complain about value 0, see below. > After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be: > 1. a design issue which should be detected and fixed during internal design validation; > 2. a manufacture issue which should be detected and fixed during manufacture process. > > There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc? 0 is different because it is what unused/unimplemented MSR bits return. I have 3 supported CPUs here for my testing: family 6 model 0xe (Core Duo) in a laptop, family 6 model 0xf (Core2 Duo) in a desktop and family 6 model 0x1a in a workstation (Xeon). On model 0xe, reading from MSR 0x1a2 return an I/O error. On model 0xf, the MSR exists, but reading from it returns 0 for bits 23:16. On model 0x1a, it returns a valid temperature limit in bits 23:16 (97°C). > > Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless. > > A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS. I would be perfectly happy to not have any sanity check on the value, but this requires that we know upfront which CPU models have bits 23:16 of MSR 0x1a2 implemented, and which do not. As my experiment above shows, some models have the MSR implemented but not bits 23:16, and checking for value 0 is the only way to handle them in a generic way. According to the SDM, MSR 0x1a2 is supposed to only exist in Nehalem (i.e. family 6 model 0x1a) and Sandy Bridge CPUs (family and model unknown to me) but I guess it will also exist in future CPUs. I see two possible implementations: 1* Go by the SDM. First check that we are on a model which supports MSR_TEMPERATURE_TARGET before calling it, and always trust the result. For other models, use the heuristic we had before (adjust_tjmax). The major drawback is that we will have to update the list of CPU models supporting MSR_TEMPERATURE_TARGET as they are released by Intel. We just went away from having to continuously maintain a device list in the coretemp driver and it seems wrong to go back to this situation again. 2* Go by my experimental result. Read MSR_TEMPERATURE_TARGET unconditionally, but only use the result if there was no I/O error and the value isn't 0. In other cases, don't emit any warning message and simply fallback to the heuristic we had before (adjust_tjmax). This is closer to what we have at the moment, and has the advantage that it won't require further updates. Implementation 2 clearly has my favors. For stable kernel series, I'd rather just apply the patch I submitted, because it is the least intrusive option and we aren't aware of any actual system requiring a more intrusive fix. -- Jean Delvare _______________________________________________ lm-sensors mailing list lm-sensors@xxxxxxxxxxxxxx http://lists.lm-sensors.org/mailman/listinfo/lm-sensors