processes in uninterruptible state, and high load averages

"Matthew B. Brookover" <mbrookov@xxxxxxxxx> · Sun, 06 Aug 2006 23:22:15 -0600

At first, I thought this was a GFS problem, but I am not really sure.  I
figured I would try posting here to see if any body has any ideas.

Yesterday, I upgraded a six node GFS cluster from RHEL 3 update 6 to
update 8.  Today, the load averages  were up and I found 4 nodes with
processes stuck in uninterruptible sleep states.  Each process is
running the load average up by 1 on that node.  The processes are pine
on one node and imapd on the others.  I have not been able to talk to
the user that owns the processes yet.  I am guessing that the user had
one session hang so they tried others.  I tried an strace -p on several
of the processes, but it showed nothing because the process was not
doing any thing and strace does not report the current system call.

According to lsof, all of the processes have this file in
common: /u/mx/ci/XXXXXXX/.mailbox.

This is the output of lsof for .mailbox with the host name prepended:

HOSTNAME    COMMAND   PID    USER   FD   TYPE     DEVICE     SIZE    NODE NAME
imagine     pine     6662 XXXXXXX    5u   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
imagine     pine    27846 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd    7658 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd   20623 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd   21383 XXXXXXX    5u   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd   30505 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd   30530 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
incantation imapd   32023 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd    9765 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   13182 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   13451 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   13851 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   14177 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   23067 XXXXXXX    5u   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inception   imapd   23791 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   11913 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   12377 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   12776 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   13137 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   21383 XXXXXXX    5u   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   21385 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox
inspire     imapd   25077 XXXXXXX    4r   REG     254,77 14906629 5669290 /u/mx/ci/XXXXXXX/.mailbox

Other files were removed to shorten the listing.

Locking is handled with .lock files.  In my haste to get rid of some of
the processes, I tried to kill the processes and delete the lock file.
I should have done the lsof first.  There is no evidence of .lock access
in any of the lsof output.  I am not sure what lsof says about processes
that open files that are unlinked, I guess I would expect to see an open
file without a name.

Looking at the GFS mail list archives, there was an issue with PHP
causing process to be left in an uninterruptible sleep state, but they
were using flock.  From the looks of the FD column, none of these
processes have called flock on the .mailbox.

Ps output with host name prepended:

Host name    USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
imagine      XXXXXXX  27846  0.0  0.0  8552 1704 ?        D    12:36   0:00 pine
imagine      XXXXXXX   6662  0.0  0.0  8548 1712 ?        D    Aug05   0:00 pine
incantation  XXXXXXX  21383  0.0  0.0  7032 3060 ?        D    Aug05   0:00 imapd
incantation  XXXXXXX  30505  0.0  0.0  7048 3044 ?        D    Aug05   0:00 imapd
incantation  XXXXXXX  30530  0.0  0.0  7052 3028 ?        D    Aug05   0:00 imapd
incantation  XXXXXXX  32023  0.0  0.0  7028 3052 ?        D    Aug05   0:00 imapd
incantation  XXXXXXX   7658  0.0  0.0  7028 3052 ?        D    01:47   0:00 imapd
incantation  XXXXXXX  20623  0.0  0.0  7028 3052 ?        D    16:15   0:00 imapd
inception    XXXXXXX  23791  0.0  0.0  7032 3056 ?        D    Aug05   0:00 imapd
inception    XXXXXXX   9765  0.0  0.0  7028 3052 ?        D    12:36   0:00 imapd
inception    XXXXXXX  13182  0.0  0.0  7040 3028 ?        D    14:02   0:00 imapd
inception    XXXXXXX  13451  0.0  0.0  7036 3028 ?        D    14:08   0:00 imapd
inception    XXXXXXX  13851  0.0  0.0  7028 3028 ?        D    14:18   0:00 imapd
inception    XXXXXXX  14177  0.0  0.0  7040 3028 ?        D    14:28   0:00 imapd
inception    XXXXXXX  23067  0.0  0.0  6936 2540 ?        D    Aug05   0:00 imapd
inspire      XXXXXXX  21383  0.0  0.0  7028 3144 ?        D    Aug05   0:00 imapd
inspire      XXXXXXX  21385  0.0  0.0  6948 3128 ?        D    Aug05   0:00 imapd
inspire      XXXXXXX  25077  0.0  0.0  7036 3028 ?        D    Aug05   0:00 imapd
inspire      XXXXXXX  11913  0.0  0.0  7028 3024 ?        D    14:03   0:00 imapd
inspire      XXXXXXX  12377  0.0  0.0  7032 3028 ?        D    14:13   0:00 imapd
inspire      XXXXXXX  12776  0.0  0.0  7052 3032 ?        D    14:23   0:00 imapd
inspire      XXXXXXX  13137  0.0  0.0  7036 3032 ?        D    14:33   0:00 imapd

One more annoying detail:

WCHAN from ps is set to end for all processes.

Output from "ps -o pid,tt,user,fname,wchan,lstart --sort start_time -U
XXXXXXX" on each host:

::::::::::::::
imagine
::::::::::::::
  PID TT       USER     COMMAND  WCHAN                   STARTED
 6662 ?        XXXXXXX  pine     end    Sat Aug  5 16:14:53 2006
27846 ?        XXXXXXX  pine     end    Sun Aug  6 12:36:31 2006
::::::::::::::
incantation
::::::::::::::
  PID TT       USER     COMMAND  WCHAN                   STARTED
21383 ?        XXXXXXX  imapd    end    Sat Aug  5 16:12:44 2006
30505 ?        XXXXXXX  imapd    end    Sat Aug  5 20:46:59 2006
30530 ?        XXXXXXX  imapd    end    Sat Aug  5 20:48:14 2006
32023 ?        XXXXXXX  imapd    end    Sat Aug  5 21:30:33 2006
 7658 ?        XXXXXXX  imapd    end    Sun Aug  6 01:47:25 2006
20623 ?        XXXXXXX  imapd    end    Sun Aug  6 16:15:14 2006
::::::::::::::
inception
::::::::::::::
  PID TT       USER     COMMAND  WCHAN                   STARTED
23067 ?        XXXXXXX  imapd    end    Sat Aug  5 16:13:24 2006
23791 ?        XXXXXXX  imapd    end    Sat Aug  5 16:31:40 2006
 9765 ?        XXXXXXX  imapd    end    Sun Aug  6 12:36:10 2006
13182 ?        XXXXXXX  imapd    end    Sun Aug  6 14:02:15 2006
13451 ?        XXXXXXX  imapd    end    Sun Aug  6 14:08:36 2006
13851 ?        XXXXXXX  imapd    end    Sun Aug  6 14:18:39 2006
14177 ?        XXXXXXX  imapd    end    Sun Aug  6 14:28:39 2006
::::::::::::::
inspire
::::::::::::::
  PID TT       USER     COMMAND  WCHAN                   STARTED
21383 ?        XXXXXXX  imapd    end    Sat Aug  5 16:00:35 2006
21385 ?        XXXXXXX  imapd    end    Sat Aug  5 16:00:38 2006
25077 ?        XXXXXXX  imapd    end    Sat Aug  5 17:47:25 2006
11913 ?        XXXXXXX  imapd    end    Sun Aug  6 14:03:33 2006
12377 ?        XXXXXXX  imapd    end    Sun Aug  6 14:13:36 2006
12776 ?        XXXXXXX  imapd    end    Sun Aug  6 14:23:38 2006
13137 ?        XXXXXXX  imapd    end    Sun Aug  6 14:33:39 2006

I am wondering if this is even a GFS problem?  What does "end" mean in
the WCNAN column?

This set of servers has been running for almost 8 months on Enterprise 3
update 6 without any problems.

I find it a bit odd that the user managed to start 2 imapd processes on
inspire roughly 3 seconds apart.  That would be difficult with most
email user agents, in any case, the processes should not be stuck in an
uninterruptible sleep state.

Any help would be appreciated.

thank you

Matt Brookover
mbrookov@xxxxxxxxx

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list