[PATCH 1/2] k8temp warn about errata

jean-marc at spaggiari.org (Jean-Marc Spaggiari) · Sun, 26 Oct 2008 21:18:42 -0400

2008/10/25 Jean Delvare <khali at linux-fr.org>:
> Hi Rudolf,
>
> On Thu, 02 Oct 2008 00:09:09 +0200, Rudolf Marek wrote:
>> Following patch adds warning about wrong CPU temperature readouts on all fam f
>> rev f revision of CPUs.
>>
>> Used switch statement, more code changes follows.
>>
>> Signed-off-by: Rudolf Marek <r.marek at assembler.cz>
>>
>> It should "fix" http://bugzilla.kernel.org/show_bug.cgi?id=8866
>
> Thanks for working on this. Review:
>
>> Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
>> ===================================================================
>> --- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c      2008-09-28 11:01:45.855284456 +0200
>> +++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c   2008-09-28 11:13:42.396117790 +0200
>> @@ -34,6 +34,7 @@
>>
>>  #define TEMP_FROM_REG(val)   (((((val) >> 16) & 0xff) - 49) * 1000)
>>  #define REG_TEMP     0xe4
>> +#define REG_CPUID    0xfc
>>  #define SEL_PLACE    0x40
>>  #define SEL_CORE     0x04
>>
>> @@ -47,6 +48,7 @@
>>       /* registers values */
>>       u8 sensorsp;            /* sensor presence bits - SEL_CORE & SEL_PLACE */
>>       u32 temp[2][2];         /* core, place */
>> +     u8 fam;
>>  };
>>
>>  static struct k8temp_data *k8temp_update_device(struct device *dev)
>> @@ -141,6 +143,7 @@
>>       int err;
>>       u8 scfg;
>>       u32 temp;
>> +
>>       struct k8temp_data *data;
>>       u32 cpuid = cpuid_eax(1);
>>
>
> Unrelated whitespace change, please revert.
>
>> @@ -155,6 +158,18 @@
>>               goto exit;
>>       }
>>
>> +     /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
>
> That's tricky. It took me some time to figure out why the warning
> wouldn't be displayed on all family 0Fh CPUs ;)
>
>> +     pci_read_config_dword(pdev, REG_CPUID, &cpuid);
>> +
>> +     data->fam = (cpuid & 0x00000f00) >> 8;
>> +     data->fam += (cpuid & 0x00f00000) >> 20;
>
> The extended family code is an 8-bit value, so this should be:
>
> +       data->fam += (cpuid & 0x0ff00000) >> 20;
>
> I don't know if the result is guaranteed to fit in a u8.
>
>> +
>> +     switch (data->fam) {
>> +             case 0xf:
>
> Coding style: case is normally aligned on switch (see Chapter 1:
> Indentation of Documentation/CodingStyle.)
>
>> +                     dev_warn(&pdev->dev, "Temperature readouts might be wrong"
>> +                                             " - check errata #141\n");
>> +     }
>> +
>>       pci_read_config_byte(pdev, REG_TEMP, &scfg);
>>       scfg &= ~(SEL_PLACE | SEL_CORE);                /* Select sensor 0, core0 */
>>       pci_write_config_byte(pdev, REG_TEMP, scfg);
>
> This is a good start, but I don't think this is enough. I don't expect
> desktop users to search the kernel logs for that kind of information.
> The situation is made worse by the fact that the k8temp driver is one
> of the few hwmon drivers which auto-loads, so many users see the
> reported temperature, not just those who ran sensors-detect. So I'd
> rather implement a more "active" mechanism. I have several ideas in
> mind and would welcome comments on them.
>
> If all revision F and later CPUs are affected by the errata and the
> temperature reported is never correct, we should simply blacklist these
> CPUs. I guess this isn't the case, otherwise you'd have proposed that
> we do that long ago.
>
> If most but not all of these CPUs are affected, then it would make
> sense to disable them by default but give the user a chance to still
> enable them (using a module parameter.)
>
> If there are more revision F and later CPUs with working thermal
> diodes, and working ones can be told from non-working ones based on the
> exact revision, we could implement blacklisting and/or whitelisting
> base on the revision.
>
> Does anyone have an idea about the ratio of working/broken revision F
> and later CPU models?
>
> Does the status (broken or not) of the thermal diode depend on the
> exact CPU revision, or is it purely random?
>
> Alternatively, or additionally, we could use a heuristic to detect
> broken diodes. I don't much like this in general, as monitoring should
> normally not assume anything on what it is monitoring, but in this
> specific case I think it makes some sense because the thermal sensor in
> question is not a generic chip and brokenness is so widely spread.
>
> When looking at reports for broken K8 diodes, for example at:
> http://bugzilla.kernel.org/show_bug.cgi?id=8866
> it is immediately visible to a human that:
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +17 C
> Core0 Temp:   +3 C
> Core1 Temp:  +21 C
> Core1 Temp:   +5 C
>
> are NOT reasonable temperatures for a CPU.
>
> Other obviously wrong outputs (gathered on the lm-sensors mailing list):
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:   +4?C
> Core0 Temp:   -1?C
> Core1 Temp:   -2?C
> Core1 Temp:  -20?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> temp1:        -9?C
> temp2:       -20?C
> temp3:       -18?C
> temp4:       -14?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:    -6?C
> Core0 Temp:   -11?C
> Core1 Temp:    +5?C
> Core1 Temp:   -13?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:   -1?C
> Core0 Temp:   +1?C
> Core1 Temp:   +6?C
> Core1 Temp:   +6?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:   +9?C
> Core0 Temp:   +0?C
> Core1 Temp:   +3?C
> Core1 Temp:  +11?C
>
> The following outputs, OTOH, look pretty good (also gathered on the
> lm-sensors mailing list):
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +42.0?C
> Core0 Temp:  +39.0?C
> Core1 Temp:  +36.0?C
> Core1 Temp:  +39.0?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +43.0?C
> Core1 Temp:  +37.0?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +41.0?C
> Core1 Temp:  +46.0?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +38.0?C
> Core1 Temp:  +25.0?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +31.0?C
> Core1 Temp:  +32.0?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +32?C
> Core1 Temp:  +35?C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core1 Temp:   +43?C
> Core2 Temp:   +46?C
>
> There were also a few reports which I'm not sure if there are good or
> bad:
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +26 C
> Core1 Temp:  +24 C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp:  +17?C
> Core1 Temp:  +25?C
>
> So I am under the impression that it should not be too difficult to
> come up with a heuristic to spot obviously broken sensors:
> * Apparently all affected CPUs have 4 thermal sensors (I don't know if
>  this is because all revision F and later CPUs have 4 thermal sensors?)
> * Average of the 4 sensors is below 15 degrees C.
> * Minimum of the 4 sensors is below 5 degrees C
>
> So we could build a heuristic based on the number of sensors and either
> the average or the minimum temperature and reject by default all
> revision F and later CPUs which do not pass the test. We would of
> course provide a way (module parameter) for the user to change the
> temperature limit or maybe disable the heuristic entirely.
>
> Of course this isn't a perfect solution, as there are still CPUs in the
> gray zone for which no heuristic can say whether they are good or bad.
> Also, there is a small risk that a CPU which is at the limit will be
> rejected sometimes but not always, so hwmon device numbering may change
> from one boot to the next. However, I think that a good heuristic would
> solve way more problems than it would cause.
>
> Comments anyone?
>
> Thanks,
> --
> Jean Delvare

Sound like a good idea to me

For those values:

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:  +26 C
Core1 Temp:  +24 C

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:  +17?C
Core1 Temp:  +25?C

I think they are correct. In winter, I place my computer outside
(-10C/-15C). The idle core temp can be pretty low. So it's important
to give the user a way to force the sensor activation or adjust the
temperatures level for the heuristic.

So in this statement:
* Average of the 4 sensors is below 15 degrees C.
* Minimum of the 4 sensors is below 5 degrees C

The user need to be able to update 15 and 5.

Also, since in all examples the average temperature is lower than 15
degrees C (-4,75;  -15,25; -6,25; 3; 5,75) I will recommand to put it
closer to 10 degrees C.

Just my 2?.

--
JM