Re: how to debug hardware lockups?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]





On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers@xxxxxxxxx> wrote:
Hi,

We have a server which locks up about once a week (for the past 3
weeks now), without any warning, and the only way to recover it, is to
reset the server. This causes unwanted downtime, and often software
loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks
up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel
Motherboard

Attach a local console to the video port and let us know what it says -->
that will (probably) be very insightful.  E.G., Kernel panic, MCE, ....

Next, run memtest86+ -- at least overnight.  [Note: I've had less than
stellar results with memtest86 recently, but if it shows errors, you've got
a problem big time; if it doesn't show errors, you still not 100% sure that
memory is good:-):-).]  Is it ECC memory??  If not, why not -- particularly
given it is a critical server ....

Are all the fans spinning -- particularly the CPU??  Do you have lm-sensors
enabled??  Either create a script or using something like munin to track things
and see if fans, temperature, voltages are all stable & within range up to death.

Can you easilhy swap power supplies??  (Is the unit dual powered or just
one unit?)

Clearly, just a start, but you get the idea of elementary, 101 problem solving ....

   -rak-

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux