Re: Coretemp misreading?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ed,

Please don't top-post, it makes quoting you on replies difficult.

> On 26/10/2010 10:25, Ed W wrote:
> > Hi, I have a pair of (identical) new supermicro machines and had some 
> > significant issues getting accurate temps out of them.  Bios updates 
> > (betas) have fixed most of them, but I'm finally stuck with the CPU 
> > temps reading exceptionally high using the kernel coretemp module
> >
> > Board is the supermicro X8SIE-LN4F.  Processor is an intel L3426 (ie 
> > the 45W part).
> >
> > Sensors 3.1.2 reads (loaded):
> >
> > coretemp-isa-0000
> > Adapter: ISA adapter
> > Core 0:      +81.0C  (high = +94.0C, crit = +100.0C)
> >
> > coretemp-isa-0001
> > Adapter: ISA adapter
> > Core 1:      +79.0C  (high = +94.0C, crit = +100.0C)
> >
> > coretemp-isa-0002
> > Adapter: ISA adapter
> > Core 2:      +81.0C  (high = +94.0C, crit = +100.0C)
> >
> > coretemp-isa-0003
> > Adapter: ISA adapter
> > Core 3:      +77.0C  (high = +94.0C, crit = +100.0C)
> >
> > Idle the temps are around 55-60C

This is hot. my E5520 is around 45-50°C idle, and 62°C tops under heavy
load. This is all with the CPU fan under control to keep my machine
quiet. And this is a 80W TDP model.

> > There is no bios CPU temp to compare against, all I get is a "Low" 
> > reading normally, going up to a "Med" reading when the temps are as 
> > above (there is also a "high", but unknown when that kicks in)

Not sure how you can compare with what the BIOS reports, as you can't
affect the load when in the BIOS setup screen. Furthermore, without a
clue what temperature ranges correspond to "low", "med" and "high",
this isn't useful.

> > Additionally there are some temps shown through the sensors module 
> > (these were affected by the bios fixes)
> >
> > temp1:       +36.0C  (high = +60.0C, hyst = +55.0C)  sensor = thermistor
> > temp2:       +78.5C  (high = +95.0C, hyst = +92.0C)  sensor = diode
> > temp3:       +30.0C  (high = +80.0C, hyst = +75.0C)  sensor = thermistor
> >
> >
> > The bottom one coincides with the bios "System Temp" and is believed 
> > to be the thermistor at the back of the board.  Unknown what the other 
> > two are.  The Temp2 tends to look a little like the average of the CPU 
> > temps though?

temp1 seems reasonable as another thermistor on the board. temp2 is
likely to be a thermal diode in the CPU. Note how the temperature value
reported by temp2 correlates with the ones from coretemp.

> > Basic sanity check though.  I take the lid off the machine and the 
> > heatsink is completely cold at idle and slightly warm under load...  I 

This seems wrong. The heat generated by the CPU must go somewhere. Even
if you have a low power CPU model (for a Xeon, that is), the heatsink
should still be warm at idle and warmer under load. Unless you have an
8000 RPM fan mounted on it.

I suspect this is your actual problem: the heatsink doesn't do its job
properly. Either it's not properly mounted, or you need (better)
thermal paste, or the model is not suitable for the CPU/socket and you
have to try something else.

> > have reseated the heatsink on one of the machines as a sanity check, 
> > no difference.  Both machines are reporting the same range of 
> > temperatures under similar loads.  I checked temps using kernel 2.6.32 
> > and 2.6.36 with similar results
> >
> > My best guess is that the CPU temp sensor is mis-reading.  However, is 
> > this possible?

One thermal sensor mis-reading is always possible. But in your case,
you have 4 digital sensors + 1 analog sensor returning the same
temperature value. This rules out the possibility of a defective sensor.

> > I understand the CPU temp is specified as an inversion 
> > from the max value and hence in any case if the observed temp is wrong 
> > then the point is still that my "margin" for safety is something like 
> > the 94C - 81C = 13C from max temp?

Correct. The coretemp driver doesn't guarantee that the absolutes
values are correct, however the thermal margin should be, in particular
when it is small. So the limitations of the digital sensors isn't the
problem here. You really have a thermal margin which is too small.

> > This machine needs to last long term, I'm currently a bit worried 
> > about the reported temps - can anyone please shed some light on 
> > whether this is just a bad reading and should be ignored?

I would be worried as well, and no, I don't think you can ignore the
issue.

On Thu, 28 Oct 2010 16:59:48 +0100, Ed W wrote:
> Not that many folks dived in, but the conclusion to this was that 
> swapping the L3426 processor with an X series fixed the mis-reading.  I 
> presume therefore that this looks more like a kernel issue than an 
> lm-sensors issue?  Who to file bug reports to about coretemp mis-reads?

How does the heatsink feel now? Hotter than before?

Did you compare the power consumption of the system with each CPU model?

Bugs against the coretemp driver would be reported on
bugzilla.kernel.org. However in your case I don't see anything that can
be done: the analog sensor reports the same value as the coretemp
digital sensors, so if anything is really wrong, that would be the
hardware.

-- 
Jean Delvare
http://khali.linux-fr.org/wishlist.html

_______________________________________________
lm-sensors mailing list
lm-sensors@xxxxxxxxxxxxxx
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors



[Index of Archives]     [Linux Kernel]     [Linux Hardware Monitoring]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux