Hi Dave,
On 11/02/2009 12:11 PM, David Teigland wrote:
On Fri, Oct 30, 2009 at 07:27:23PM -0400, Allen Belletti wrote:
I'll notice the problem when the load average starts rising. It's
always tied to "stuck" processes, and I believe always tied to IMAP
clients (I'm running Dovecot.) It seems like a file belonging to user
"x" (in this case, "jforrest" will become locked in some way, such that
every IMAP process tied that user will get stuck on the same thing.
Over time, as the user keeps trying to read that file, more& more
processes accumulate. They're always in state "D" (uninterruptible
sleep), and always on "dlm_posix_lock" according to WCHAN. The only way
I'm able to get out of this state is to reboot. If I let it persist for
too long, I/O generally stops entirely.
Next time, try to collect all the following information as soon as you can
after the first process gets stuck:
- ps showing pid of stuck/"D" process(es) and WCHAN
- which file they are stuck trying to lock
(and the inode number of it, you may need to wait until after the
reboot to use ls -li on the file to get the inode number)
- group_tool dump plocks<fsname> from all the nodes
I'm guessing that dovecot does some "unusual" combinations of locking,
closing, renaming, unlinking files. Those combinations are especially
prone to races and bugs that cause posix lock state to get off.
I'll collect all of this as soon as I catch the problem in action
again. Do you know how I might go about determine which file is
involved? I can find the user because it's associated with the
particular "imap" process, but haven't been able to figure out what's
being locked.
Thanks,
Allen
--
Allen Belletti
allen@xxxxxxxxxxxxxxx 404-894-6221 Phone
Industrial and Systems Engineering 404-385-2988 Fax
Georgia Institute of Technology
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster