On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch <niftycluster@xxxxxxxxxxxx> wrote: > On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote: >> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk@xxxxxxxxxxx> wrote: >> > Rudi Ahlers wrote: >> > >> >> We have a server which locks up about once a week (for the >> >> past 3 > ...... >> >> How do I debug the server, which runs CentOS 5.2 to see why >> >> it locks >> >> up? > > Jumping in the middle of a long list of good ideas. > Other things to try -- > change the run level > if 5 switch to 3 > if 3 switch to 5 > > Reinstall the processor-- > remove the processor > clean the heat sink and processor of thermal compound > correctly apply the best thermal grease you can get (I like Arctic Silver) > reinstall the heat sink > consider upgrading the processor heat sink if the chassis permits (more Cu is good). > > Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp. > > Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on. > > Turn off any special system monitoring software tools. Things like I2C serial buses > do not isolate simple read only activity from things that might modify (shut > down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC > installed. The two tools had overlapping uncoordinated interactions with > the hardware and would randomly shut down the system. Very new boards are almost > never supported well so consider going blind. Read EDAC info on CentOS and RH sites. > > Inspect then tidy all cables they can mess up air flow and cause thermal issues. > > Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. > When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new > may differ. The keyboard sequence to reset a BIOS to all defaults may require > a call to tech support. Call the vendor.. you have a warranty on a new board. > > Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script > that lets you see memory, processes and the exact time things fail or just "top". > It is possible to have syslog also log to the pty of a ssh session. > When you return to the cage plugin a terminal. If there is no screen saver or > screen blanking the GFX card may still display the last key bits of info > so long as X is not running. > > > -- > T o m M i t c h e l l > Found me a new hat, now what? > > _______________________________________________ Thanx Tom, You gave some good ideas, and I've been through all of them. As a general rule of thumb, I only purchase RAM with factory fitted heatsinks attached to them. The chassis is a 1U chassis, so space is limited, and only the necessary cables are installed & tidied up already. After spending another 2 days in the datacentre trying to figure this one out, I thought I'd take the machine to the office instead. It's just so much nicer working in the office :) Top didn't help much, since I couldn't see what's wrong. But, sitting at my desk and running some tests & noticed that the fan was running so load at times, that I couldn't even talk to someone on the phone. This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed. The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165 So, it looks like it's not really made for a Q9300 CPU, although their specs say it is. -- Kind Regards Rudi Ahlers _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos