Re: Bizzaro system lockup/hang, possible kernel issue.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2006-06-09 at 14:40 +0900, Naoki wrote:
> Hi all,
> 
> We have a range of FC5 boxes (~4 with the problem) and from roughly
> three/four weeks ago we've been seeing intermittent lockups/hangs which
> often result in having to power-cycle. The boxes are currently running
> 2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different
> hardware all display the same issue. The filesystem is ResiserFS.
> 
> Disk operations will stall, commands like 'free' will (might) return
> quickly, but 'uptime' could take 5 minutes to complete. The system is
> essentially unusable.
> 
> Fork rate and # of procs seems to skyrocket however, not sure if this is
> just monitoring going strange because of the underlying problem though.
> 
> I've finally managed to capture the issue in more detail and was hoping
> somebody had a clue, here I perform an "strace -tt ls /var" :
> 
> ...
> 12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072)
> = 679
> 12:01:28.034163 close(4)                = 0
> 12:01:28.034267 munmap(0xb7cda000, 131072) = 0
> 12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755,
> st_size=328, ...}) = 0
> 12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755,
> st_size=72, ...}) = 0
> 12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700,
> st_size=80, ...}) = 0
> 12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access", 
> ...
> 
> You can see that on the lstat64 to /var/tomcat4 the timestamp jumps
> back, in actuality that sys call too about 40 seconds to complete.
> 
> I have done this a couple of times on different areas of the disk and
> the results are the same, lstat64 is hanging for extremely long periods.
> 
> Another side effect of this is the system time becomes skewed during the
> hang on an lstat (probably other calls do this but I've not been able to
> trace enough ).
> 
> On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if
> that helps, on another I've reverted all the way back to
> kernel-2.6.16-1.2108_FC4.i686.rpm.
> 
> I've run the smartctl utility to check the disk is ok and that has
> passed on all servers.  I'll wait and see the results of my kernel
> updates/regressions.

Happened to another server. Not one of the above mentioned with replaced
kernels, but this once also with 2.6.16-1.2122_FC5.

# date; ls -l /var ; date
Fri Jun  9 18:27:36 JST 2006
total 3
drwxr-xr-x 10 root root   264 May 17 11:08 cache
drwxr-xr-x  3 root root    72 Feb 12 02:16 db
drwxr-xr-x  3 root root    72 Feb 12 02:16 empty
drwxr-xr-x  7 vcp  vcp    200 Dec 15 12:12 jsp
drwxr-xr-x 17 root root   480 May 17 11:20 lib
drwxr-xr-x  2 root root    48 Feb 12 02:16 local
drwxrwxr-x  6 root lock   144 Jun  9 05:12 lock
drwxr-xr-x  9 root root  1896 Jun  4 05:24 log
lrwxrwxrwx  1 root root    10 May 17 10:55 mail -> spool/mail
drwxr-x---  4 root named   96 Apr 19 23:12 named
drwx------  2 root root    80 May 27 10:10 net-snmp
drwxr-xr-x  2 root root    48 Feb 12 02:16 nis
drwxr-xr-x  2 root root    48 Feb 12 02:16 opt
drwxr-xr-x  2 root root    48 Feb 12 02:16 preserve
drwxr-xr-x 15 root root   696 Jun  9 05:12 run
drwxr-xr-x 13 root root   328 Feb 12 02:16 spool
drwxrwxrwt  2 root root    48 Jun  9 05:12 tmp
drwxr-xr-x  3 root root    72 Nov 18  2003 tomcat4
drwxr-xr-x  6 root root   144 Feb 12 08:12 www
drwxr-xr-x  3 root root   128 May 17 11:11 yp
Fri Jun  9 18:27:36 JST 2006

Notice the time didn't change, but immediately after it printed the
first date/time it then hung for 30 seconds before the 'ls' output was
printed.
Then I kept running the 'date' command and you can see what's
happening :

[root@banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:38 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:39 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:39 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root@banner8 ~]# date
Fri Jun  9 18:27:38 JST 2006

Anybody seen anything like _that_ before? It is running ntpd.

-- 
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [Fedora Magazine]     [Fedora News]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Maintainers]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [Fedora Fonts]     [ATA RAID]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [SSH]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Tux]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Asterisk PBX]     [Fedora Sparc]     [Fedora Universal Network Connector]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux