At first, I thought this was a GFS problem, but I am not really sure. I figured I would try posting here to see if any body has any ideas. Yesterday, I upgraded a six node GFS cluster from RHEL 3 update 6 to update 8. Today, the load averages were up and I found 4 nodes with processes stuck in uninterruptible sleep states. Each process is running the load average up by 1 on that node. The processes are pine on one node and imapd on the others. I have not been able to talk to the user that owns the processes yet. I am guessing that the user had one session hang so they tried others. I tried an strace -p on several of the processes, but it showed nothing because the process was not doing any thing and strace does not report the current system call. According to lsof, all of the processes have this file in common: /u/mx/ci/XXXXXXX/.mailbox. This is the output of lsof for .mailbox with the host name prepended: HOSTNAME COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME imagine pine 6662 XXXXXXX 5u REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox imagine pine 27846 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 7658 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 20623 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 21383 XXXXXXX 5u REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 30505 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 30530 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox incantation imapd 32023 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 9765 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 13182 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 13451 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 13851 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 14177 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 23067 XXXXXXX 5u REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inception imapd 23791 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 11913 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 12377 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 12776 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 13137 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 21383 XXXXXXX 5u REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 21385 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox inspire imapd 25077 XXXXXXX 4r REG 254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox Other files were removed to shorten the listing. Locking is handled with .lock files. In my haste to get rid of some of the processes, I tried to kill the processes and delete the lock file. I should have done the lsof first. There is no evidence of .lock access in any of the lsof output. I am not sure what lsof says about processes that open files that are unlinked, I guess I would expect to see an open file without a name. Looking at the GFS mail list archives, there was an issue with PHP causing process to be left in an uninterruptible sleep state, but they were using flock. From the looks of the FD column, none of these processes have called flock on the .mailbox. Ps output with host name prepended: Host name USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND imagine XXXXXXX 27846 0.0 0.0 8552 1704 ? D 12:36 0:00 pine imagine XXXXXXX 6662 0.0 0.0 8548 1712 ? D Aug05 0:00 pine incantation XXXXXXX 21383 0.0 0.0 7032 3060 ? D Aug05 0:00 imapd incantation XXXXXXX 30505 0.0 0.0 7048 3044 ? D Aug05 0:00 imapd incantation XXXXXXX 30530 0.0 0.0 7052 3028 ? D Aug05 0:00 imapd incantation XXXXXXX 32023 0.0 0.0 7028 3052 ? D Aug05 0:00 imapd incantation XXXXXXX 7658 0.0 0.0 7028 3052 ? D 01:47 0:00 imapd incantation XXXXXXX 20623 0.0 0.0 7028 3052 ? D 16:15 0:00 imapd inception XXXXXXX 23791 0.0 0.0 7032 3056 ? D Aug05 0:00 imapd inception XXXXXXX 9765 0.0 0.0 7028 3052 ? D 12:36 0:00 imapd inception XXXXXXX 13182 0.0 0.0 7040 3028 ? D 14:02 0:00 imapd inception XXXXXXX 13451 0.0 0.0 7036 3028 ? D 14:08 0:00 imapd inception XXXXXXX 13851 0.0 0.0 7028 3028 ? D 14:18 0:00 imapd inception XXXXXXX 14177 0.0 0.0 7040 3028 ? D 14:28 0:00 imapd inception XXXXXXX 23067 0.0 0.0 6936 2540 ? D Aug05 0:00 imapd inspire XXXXXXX 21383 0.0 0.0 7028 3144 ? D Aug05 0:00 imapd inspire XXXXXXX 21385 0.0 0.0 6948 3128 ? D Aug05 0:00 imapd inspire XXXXXXX 25077 0.0 0.0 7036 3028 ? D Aug05 0:00 imapd inspire XXXXXXX 11913 0.0 0.0 7028 3024 ? D 14:03 0:00 imapd inspire XXXXXXX 12377 0.0 0.0 7032 3028 ? D 14:13 0:00 imapd inspire XXXXXXX 12776 0.0 0.0 7052 3032 ? D 14:23 0:00 imapd inspire XXXXXXX 13137 0.0 0.0 7036 3032 ? D 14:33 0:00 imapd One more annoying detail: WCHAN from ps is set to end for all processes. Output from "ps -o pid,tt,user,fname,wchan,lstart --sort start_time -U XXXXXXX" on each host: :::::::::::::: imagine :::::::::::::: PID TT USER COMMAND WCHAN STARTED 6662 ? XXXXXXX pine end Sat Aug 5 16:14:53 2006 27846 ? XXXXXXX pine end Sun Aug 6 12:36:31 2006 :::::::::::::: incantation :::::::::::::: PID TT USER COMMAND WCHAN STARTED 21383 ? XXXXXXX imapd end Sat Aug 5 16:12:44 2006 30505 ? XXXXXXX imapd end Sat Aug 5 20:46:59 2006 30530 ? XXXXXXX imapd end Sat Aug 5 20:48:14 2006 32023 ? XXXXXXX imapd end Sat Aug 5 21:30:33 2006 7658 ? XXXXXXX imapd end Sun Aug 6 01:47:25 2006 20623 ? XXXXXXX imapd end Sun Aug 6 16:15:14 2006 :::::::::::::: inception :::::::::::::: PID TT USER COMMAND WCHAN STARTED 23067 ? XXXXXXX imapd end Sat Aug 5 16:13:24 2006 23791 ? XXXXXXX imapd end Sat Aug 5 16:31:40 2006 9765 ? XXXXXXX imapd end Sun Aug 6 12:36:10 2006 13182 ? XXXXXXX imapd end Sun Aug 6 14:02:15 2006 13451 ? XXXXXXX imapd end Sun Aug 6 14:08:36 2006 13851 ? XXXXXXX imapd end Sun Aug 6 14:18:39 2006 14177 ? XXXXXXX imapd end Sun Aug 6 14:28:39 2006 :::::::::::::: inspire :::::::::::::: PID TT USER COMMAND WCHAN STARTED 21383 ? XXXXXXX imapd end Sat Aug 5 16:00:35 2006 21385 ? XXXXXXX imapd end Sat Aug 5 16:00:38 2006 25077 ? XXXXXXX imapd end Sat Aug 5 17:47:25 2006 11913 ? XXXXXXX imapd end Sun Aug 6 14:03:33 2006 12377 ? XXXXXXX imapd end Sun Aug 6 14:13:36 2006 12776 ? XXXXXXX imapd end Sun Aug 6 14:23:38 2006 13137 ? XXXXXXX imapd end Sun Aug 6 14:33:39 2006 I am wondering if this is even a GFS problem? What does "end" mean in the WCNAN column? This set of servers has been running for almost 8 months on Enterprise 3 update 6 without any problems. I find it a bit odd that the user managed to start 2 imapd processes on inspire roughly 3 seconds apart. That would be difficult with most email user agents, in any case, the processes should not be stuck in an uninterruptible sleep state. Any help would be appreciated. thank you Matt Brookover mbrookov@xxxxxxxxx -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list