Hi, On Thu, 2009-12-03 at 17:30 -0500, Allen Belletti wrote: > Hi All, > > After Steve and the RedHat guys dug into my nasty crashdump (thanks > all!), I believe I'm down to the last GFS2 problem on our mail cluster, > but it's a common one. > > I've always had trouble with processes getting stuck on GFS2 access and > queuing up. Since the 5.4 upgrade and moving the proper GFS2 kernel > module, it's changed but not gone away. Ever few days now, I'm seeing > processes getting stuck with WCHAN=just_schedule. Once this starts > happening, both cluster nodes will accumulate them rapidly which > eventually brings IO to a halt. The only way I've found to escape is > via a reboot, sometimes of one, sometimes of both nodes. > > Since there's no crash, I don't get any useful debug information. > Outside of this one repeating glitch, performance is great and all is > well. If anyone can suggest ways of gathering more data about the > problem, or possible solutions, I would be grateful. > > Thanks, > Allen > > This would be typical for what happens when there is contention on a glock between two (or more) nodes. There is a mechanism which is supposed to try and mitigate the issue (by allowing each node to hold on to a glock for a minimum period of time which is designed to ensure that some work is done each time a node acquires a glock) but if your storage is particularly slow, and/or possibly depending upon the exact I/O pattern, it may not always be 100% effective. In the first instance though, see if you can find an inode which is being contended from both nodes as that will most likely be the culprit, Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster