On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse <rkarhuse@xxxxxxxxx> wrote: > > > On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers@xxxxxxxxx> wrote: >> >> Hi, >> >> We have a server which locks up about once a week (for the past 3 >> weeks now), without any warning, and the only way to recover it, is to >> reset the server. This causes unwanted downtime, and often software >> loss as well. >> >> How do I debug the server, which runs CentOS 5.2 to see why it locks >> up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel >> Motherboard > > Attach a local console to the video port and let us know what it says --> > that will (probably) be very insightful. E.G., Kernel panic, MCE, .... > > Next, run memtest86+ -- at least overnight. [Note: I've had less than > stellar results with memtest86 recently, but if it shows errors, you've got > a problem big time; if it doesn't show errors, you still not 100% sure that > memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly > given it is a critical server .... > > Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors > enabled?? Either create a script or using something like munin to track > things > and see if fans, temperature, voltages are all stable & within range up to > death. > > Can you easilhy swap power supplies?? (Is the unit dual powered or just > one unit?) > > Clearly, just a start, but you get the idea of elementary, 101 problem > solving .... > > -rak- > > > _______________________________________________ Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy. How can I redirect all console output to a file instead? I have got lm-sensors installed, but it doesn't pick-up the motherboard's sensors. All fans are working when I checked last time, but it's a 1U chassis, so it's got limited air-flow. I don't know if it get's too hot, or not. When I rebooted it, the temp was about 45 degrees celcius, but the lockup only happened about 6 days later. So, I can't even sit there 24/7 to see what happens. -- Kind Regards Rudi Ahlers _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos