Hi, On Tue, 2010-07-27 at 05:57 -0700, Scooter Morris wrote: > On 7/27/10 5:15 AM, Steven Whitehouse wrote: > > Hi, > > > > If you translate a5b67f into decimal, then that is the inode number of > > the inode which is causing a problem. It looks to me as if you have too > > many processes trying to access this one inode from multiple nodes. > > > > Its not obvious from the traces that anything is actually stuck, but if > > you take two traces, a few seconds or minutes apart, then it should > > become more obvious whether the cluster is making progress or whether it > > really is stuck, > > > > Steve. > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > Hi Steve, > As always, thanks for the reply. The cluster was, indeed, truly > stuck. I rebooted it last night to clear everything out. I never did > figure out which file was the problem. I did a find -inum, but the find > hung too. By that point the load average was up to 80 and climbing. > Any ideas on how to avoid this? Are there tunable values I need to > increase to allow more processes to access any individual inode? > The LA includes processes waiting for glocks since that is an uninterruptible wait, so thats where most of the LA came from. The find is unlikely to work while the cluster is stuck, since if it does find the cuplrit inode, it is, by definition already stuck so the find process would just join the queue. If a find fails to discover the inode when the cluster has been rebooted and is back working again, then it was probably a temporary file of some kind. There are no tunable values since the limitation on the access to the inode is the speed of the hardware in terms of how many times a given inode can be synced, invalidated and the glock passed on to another node in a given time period. It is a limitation of the hardware and the architecture of the filesystem. There are a few things which can probably be improved in due course, but in the main the best way to avoid problems with congestion on inodes is just to be careful about the access pattern across nodes. That said, if it really was completely stuck, that is a real bug and not the result of the access pattern since the code is designed such that progress should always be made, even if its painfully slow, Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster