Re: GFS2: processes stuck in "just schedule"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/04/2009 04:39 AM, Steven Whitehouse wrote:
Hi,

On Thu, 2009-12-03 at 17:30 -0500, Allen Belletti wrote:
Hi All,

After Steve and the RedHat guys dug into my nasty crashdump (thanks
all!), I believe I'm down to the last GFS2 problem on our mail cluster,
but it's a common one.

I've always had trouble with processes getting stuck on GFS2 access and
queuing up.  Since the 5.4 upgrade and moving the proper GFS2 kernel
module, it's changed but not gone away.  Ever few days now, I'm seeing
processes getting stuck with WCHAN=just_schedule.  Once this starts
happening, both cluster nodes will accumulate them rapidly which
eventually brings IO to a halt.  The only way I've found to escape is
via a reboot, sometimes of one, sometimes of both nodes.

Since there's no crash, I don't get any useful debug information.
Outside of this one repeating glitch, performance is great and all is
well.  If anyone can suggest ways of gathering more data about the
problem, or possible solutions, I would be grateful.

Thanks,
Allen


This would be typical for what happens when there is contention on a
glock between two (or more) nodes. There is a mechanism which is
supposed to try and mitigate the issue (by allowing each node to hold on
to a glock for a minimum period of time which is designed to ensure that
some work is done each time a node acquires a glock) but if your storage
is particularly slow, and/or possibly depending upon the exact I/O
pattern, it may not always be 100% effective.

In the first instance though, see if you can find an inode which is
being contended from both nodes as that will most likely be the culprit,
We've got a 3-4 year old Sun 3510 FC array shared between the two nodes. The utilization on it is generally quite reasonable, so I doubt that this would qualify as "particularly slow". Also, the very busiest times for the mail system are usually during the night rsync backups and it rarely if ever gets wedged during those times.

Can you give me some hints as to how I might go about finding a inode that's being contended for by both nodes? I assume that would be useful to confirm what the problem is at least.

Thanks,
Allen

--
Allen Belletti
allen@xxxxxxxxxxxxxxx                             404-894-6221 Phone
Industrial and Systems Engineering                404-385-2988 Fax
Georgia Institute of Technology

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux