PS = Pete Stieber
PS>> I have a dual opteron system that has been acting as
PS>> the worldly node for a small cluster of computers
PS>> since September, 2004. The machine is running the
PS>> latest x86_64 Fedora 10 kernel that I recently loaded
PS>> (April 2). The machine reboots without warning. I
PS>> can't find the cause in log files (maybe I'm not
PS>> looking in the correct log).
PS>>
PS>> I'm currently running memtest. If all of the tests
PS>> pass, could the community suggest other diagnostic
PS>> tasks or information I could post to help diagnose the
PS>> problem?
m> Have you tried going back to the previous kernel?
The machine is still running memtest (no errors so far), but I already
removed the prior kernel. I did notice reboots with the prior kernel.
BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64.
Reboots indicated by information in /var/log/messages...
Sunday March 29 4:08
Tuesday March 31 7:02
Thursday April 2 18:27 Intentional reboot due to new kernel
Friday April 3 1:36
Sunday April 5 1:37
Sunday April 5 2:48
Sunday April 5 9:43
Sunday April 5 13:20 as I was typing this email
m> Did you check dmesg and /var/log/messages?
Yes. I can see reboots, but not the cause.
m> Does it boot normally and then just fail at some random
m> interval or is it consistently failing at the same point?
I have had top running during a few of the reboots. I have forced a
couple of them by starting my nightly build process. The linker/loader
has been running during some of the reboots...
top - 13:19:53 up 3:36, 6 users, load average: 1.27, 2.70, 2.32
Tasks: 138 total, 6 running, 132 sleeping, 0 stopped, 0 zombie
Cpu(s): 40.8%us, 13.8%sy, 0.0%ni, 42.5%id, 2.7%wa, 0.0%hi, 0.3%si,
0.0%st
Mem: 2060232k total, 1683996k used, 376236k free, 164484k buffers
Swap: 2031608k total, 56k used, 2031552k free, 1230796k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8878 pstieber 20 0 34552 25m 1096 R 7.6 1.3 0:00.23 ld
8884 pstieber 20 0 48284 27m 1080 R 5.0 1.4 0:00.15 ld
7 root 15 -5 0 0 0 S 0.3 0.0 0:00.17 ksoftirqd/1
22427 pstieber 20 0 14880 1208 872 R 0.3 0.1 0:03.49 top
1 root 20 0 4096 876 616 S 0.0 0.0 0:00.71 init
Another instance
top - 06:55:13 up 17:34, 2 users, load average: 2.83, 2.59, 1.86
Tasks: 127 total, 2 running, 125 sleeping, 0 stopped, 0 zombie
Cpu(s): 45.1%us, 4.7%sy, 0.0%ni, 49.8%id, 0.5%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2060232k total, 1763404k used, 296828k free, 177052k buffers
Swap: 2031608k total, 56k used, 2031552k free, 1271964k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5757 pstieber 20 0 79788 69m 1080 R 12.3 3.5 0:00.37 ld
1 root 20 0 4096 876 616 S 0.0 0.0 0:00.68 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
I'm not sure this is always the case.
m> Other things you may consider:
m> CPU type?
Motherboard: Tyan Thunder K8W (S2885ANRF)
CPUs: Dual Opteron 244 (1.8 GHz) processors
Memory: 2 GB 4-512MB CT6472Y40B DDR PC3200 from Crucial
m> temperature?
Is there a command to monitor this while running the OS?
m> potential hard drive issue?
I have 3 SATA drives running. It's been so long since I have done this,
but how does one manually do a disk chack?
m> any new hardware attached or installed recently?
No
m> Notice any power surges or brownouts?
The machine is on a UPS that deals with this.
m> any other nodes having issues?
No and they are not on UPSs. They also do not have as large of a work load.
The machine in question is used for nightly builds and regression tests.
I use distcc with the compute nodes to perform the builds.
The machine also runs samba to provide a network share to Windows users
and provides authentication using Windows domain accounts.
m> Recent power surge zapped a board, DSL modem,
m> and the surge protector.
I doubt this is the problem.
Memtest make it through the first pass of all test successfully.
Thanks for the suggestions, especially considering my vague information.
Pete
--
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines