On Thu, Nov 20, 2008 at 10:27 AM, Rudi Ahlers <rudiahlers@xxxxxxxxx> wrote: > On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch > <niftycluster@xxxxxxxxxxxx> wrote: >> On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote: >>> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk@xxxxxxxxxxx> wrote: >>> > Rudi Ahlers wrote: >>> > >>> >> We have a server which locks up about once a week (for the >>> >> past 3 >> ...... >>> >> How do I debug the server, which runs CentOS 5.2 to see why >>> >> it locks >>> >> up? >> >> Jumping in the middle of a long list of good ideas. >> Other things to try -- >> change the run level >> if 5 switch to 3 >> if 3 switch to 5 >> >> Reinstall the processor-- >> remove the processor >> clean the heat sink and processor of thermal compound >> correctly apply the best thermal grease you can get (I like Arctic Silver) >> reinstall the heat sink >> consider upgrading the processor heat sink if the chassis permits (more Cu is good). >> >> Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp. >> >> Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on. >> >> Turn off any special system monitoring software tools. Things like I2C serial buses >> do not isolate simple read only activity from things that might modify (shut >> down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC >> installed. The two tools had overlapping uncoordinated interactions with >> the hardware and would randomly shut down the system. Very new boards are almost >> never supported well so consider going blind. Read EDAC info on CentOS and RH sites. >> >> Inspect then tidy all cables they can mess up air flow and cause thermal issues. >> >> Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. >> When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new >> may differ. The keyboard sequence to reset a BIOS to all defaults may require >> a call to tech support. Call the vendor.. you have a warranty on a new board. >> >> Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script >> that lets you see memory, processes and the exact time things fail or just "top". >> It is possible to have syslog also log to the pty of a ssh session. >> When you return to the cage plugin a terminal. If there is no screen saver or >> screen blanking the GFX card may still display the last key bits of info >> so long as X is not running. >> >> >> -- >> T o m M i t c h e l l >> Found me a new hat, now what? >> >> _______________________________________________ > > > Thanx Tom, > > You gave some good ideas, and I've been through all of them. As a > general rule of thumb, I only purchase RAM with factory fitted > heatsinks attached to them. The chassis is a 1U chassis, so space is > limited, and only the necessary cables are installed & tidied up > already. > > After spending another 2 days in the datacentre trying to figure this > one out, I thought I'd take the machine to the office instead. It's > just so much nicer working in the office :) > > Top didn't help much, since I couldn't see what's wrong. But, sitting > at my desk and running some tests & noticed that the fan was running > so load at times, that I couldn't even talk to someone on the phone. > This is when I realized that the Q9300 CPU could be too big a > processor for the fan that I have installed. > > The fan that I have, is: > http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165 > > So, it looks like it's not really made for a Q9300 CPU, although their > specs say it is. > > > -- > As an interesting side note, with all the other servers & cabinets in the datacentre, the DB level is so high that it's difficult to pickup a fan that's blowing at full force the whole time. Only when I was at the office, I could hear it. My own PC is totally fan & noise free, so could easily hear when the fan was running fine, and when it was running at full speed. And that also only when I started the VPS's on the server, and couldn't ping / SSH it over the network. Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing. -- Kind Regards Rudi Ahlers _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos