Re: Bizzaro system lockup/hang, possible kernel issue.

"Bart Couvreur" <dencouf@xxxxxxxxx> · Fri, 9 Jun 2006 12:26:04 +0200

2006/6/9, Naoki <naoki@xxxxxxxxxxxxxxxxx>:
On Fri, 2006-06-09 at 14:40 +0900, Naoki wrote:
> Hi all,
>
> We have a range of FC5 boxes (~4 with the problem) and from roughly
> three/four weeks ago we've been seeing intermittent lockups/hangs which
> often result in having to power-cycle. The boxes are currently running
> 2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different
> hardware all display the same issue. The filesystem is ResiserFS.
>
> Disk operations will stall, commands like 'free' will (might) return
> quickly, but 'uptime' could take 5 minutes to complete. The system is
> essentially unusable.
>
> Fork rate and # of procs seems to skyrocket however, not sure if this is
> just monitoring going strange because of the underlying problem though.
>
> I've finally managed to capture the issue in more detail and was hoping
> somebody had a clue, here I perform an "strace -tt ls /var" :
>
> ...
> 12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072)
> = 679
> 12:01:28.034163 close(4)                = 0
> 12:01:28.034267 munmap(0xb7cda000, 131072) = 0
> 12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755,
> st_size=328, ...}) = 0
> 12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755,
> st_size=72, ...}) = 0
> 12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700,
> st_size=80, ...}) = 0
> 12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access",
> ...
>
> You can see that on the lstat64 to /var/tomcat4 the timestamp jumps
> back, in actuality that sys call too about 40 seconds to complete.
>
> I have done this a couple of times on different areas of the disk and
> the results are the same, lstat64 is hanging for extremely long periods.
>
> Another side effect of this is the system time becomes skewed during the
> hang on an lstat (probably other calls do this but I've not been able to
> trace enough ).
>
> On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if
> that helps, on another I've reverted all the way back to
> kernel-2.6.16-1.2108_FC4.i686.rpm.
>
> I've run the smartctl utility to check the disk is ok and that has
> passed on all servers.  I'll wait and see the results of my kernel
> updates/regressions.

Happened to another server. Not one of the above mentioned with replaced
kernels, but this once also with 2.6.16-1.2122_FC5.

# date; ls -l /var ; date
Fri Jun  9 18:27:36 JST 2006
total 3
drwxr-xr-x 10 root root   264 May 17 11:08 cache
drwxr-xr-x  3 root root    72 Feb 12 02:16 db
drwxr-xr-x  3 root root    72 Feb 12 02:16 empty
drwxr-xr-x  7 vcp  vcp    200 Dec 15 12:12 jsp
drwxr-xr-x 17 root root   480 May 17 11:20 lib
drwxr-xr-x  2 root root    48 Feb 12 02:16 local
drwxrwxr-x  6 root lock   144 Jun  9 05:12 lock
drwxr-xr-x  9 root root  1896 Jun  4 05:24 log
lrwxrwxrwx  1 root root    10 May 17 10:55 mail -> spool/mail
drwxr-x---  4 root named   96 Apr 19 23:12 named
drwx------  2 root root    80 May 27 10:10 net-snmp
drwxr-xr-x  2 root root    48 Feb 12 02:16 nis
drwxr-xr-x  2 root root    48 Feb 12 02:16 opt
drwxr-xr-x  2 root root    48 Feb 12 02:16 preserve
drwxr-xr-x 15 root root   696 Jun  9 05:12 run
drwxr-xr-x 13 root root   328 Feb 12 02:16 spool
drwxrwxrwt  2 root root    48 Jun  9 05:12 tmp
drwxr-xr-x  3 root root    72 Nov 18  2003 tomcat4
drwxr-xr-x  6 root root   144 Feb 12 08:12 www
drwxr-xr-x  3 root root   128 May 17 11:11 yp
Fri Jun  9 18:27:36 JST 2006

Notice the time didn't change, but immediately after it printed the
first date/time it then hung for 30 seconds before the 'ls' output was
printed.
Then I kept running the 'date' command and you can see what's
happening :

[snip]

Jep I had this last night, while running yum: I started it at 11:45 pm
and when I looked at my box this morning yum had stalled and the time
was 0:45 am (the real time was 8:30 am). And ntpd is also running.

I'm suspecting this has something to do with the kernel, but not sure.
(running 2.6.16-1.2122_FC5). I've looked through all the logs, but it
doesn't mention a thing.

I hope this gets solved soon, has been quite irritating,
Bart

--
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list