Re: Machine hang - how to know what happens?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shreyansh Jain escreveu:
Dear All,

I have a Intel desktop machine having P4 processor and 512MB RAM. It
has a custom 2.6.25 (custom because config has been changed to
de-select/select unnecessary/required things before compilation)
running over SLES10 distro.

I have noticed that this machine tends to hang after running
un-interrupted for a certain number of days. There is no fixed pattern
that happens (no fixed number of days), and hangs might occur as
frequent as 2-3 days and as delayed as 7 days.

I have noticed this happening for no apparent reason. This machine is
being used as a ssh box containing a repository of kernel sources -
thats it. There is no configured web-server or background application
running on this.

Problem:
1. The hangs is such that there is nothing on the display and hence I
cannot see what state the machine is (not that I am expecting that
would help in such case).
2. There is nothing unsual in /var/log/messages, /var/log/warn,
/var/log/mcelog ... and many other log files.
3. There is no crash dump either, even when I have configured
kexec/kdump on this. It work, becuase I tested it by triggering using
sysrq.
4. There are no kernel messages about any failed device or similar
things in past logs (once I have rebooted).

Output of /var/log/messages from one of the most recent stall is:

---8<----
Jun  9 04:25:35 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 05:25:36 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 06:25:36 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 07:25:36 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 08:25:36 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 09:25:36 DogMatix syslog-ng[3516]: STATS: dropped 0
Jun  9 12:44:12 DogMatix syslog-ng[2732]: syslog-ng version 1.6.8 starting
----8<----

Notice that syslog is printing something each hour, and then there is
stall after 09:25. Last line is bootup message after hard-booting the
machine.
Dogmatix is the name of the machine.

Question:
What should be done in such situations? What can be a reliable method
to know the real reason behing such stalls?
Any ideas/hints/suggestions are most welcome. I would like to solve
this mystery rather than live with it.

--
Shreyansh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ


You must check the main reasons for this:

a) run memtest;
b) check the processor, disks and any other temperature available;
c) check for bad pci devices like network cards. This is the hard part IMO because there's no log.

In a nutshell, you must eliminate every hardware piece, one by one. Then, you'll find the culprit.

--


--
Best Regards

Alan Menegotto


--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ


[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux