Hi, On Sat, 28 Mar 2009, Wendy Cheng wrote: > Kadlecsik Jozsef wrote: > > > I don't see a strong evidence of deadlock (but it could) from the > > > thread backtraces However, assuming the cluster worked before, you > > > could have overloaded the e1000 driver in this case. There are > > > suspicious page faults but memory is very "ok". So one possibility > > > is that GFS had generated too many sync requests that flooded the > > > e1000. As the result, the cluster heart beat missed its interval. > > > > It's a possibility. But it assumes also that the node freezes >because< it > > was fenced off. So far nothing indicates that. > > Re-read your console log. There are many foot-prints of spin_lock - that's > worrisome. Hit a couple of "sysrq-w" next time when you have hangs, other > than sysrq-t. This should give traces of the threads that are actively on CPUs > at that time. Also check your kernel change log (to see whether GFS has any > new patch that touches spin lock that doesn't in previous release). I went through the git changelogs yesterday but could not spot anything suspicious, however I'm not a filesystem expert at all. The patch titled gfs-kernel: Bug 450209: Create gfs1-specific lock modules + minor fixes to build with 2.6.27 hit me hard as according to the description, it was *not* tested in cluster environmet when it did replace dlm behind gfs. I reached the decision and we downgraded - could not delay anymore: cluster-2.03.11 -> cluster-2.01.00 linux-2.6.27.21 -> linux-2.6.23.17 The e1000 and e1000e drivers are the newest ones. The aoe driver is from aoe6-59 because aoe6-69 does not support 2.6.23.17. We did not downgrade openais and LVM2. Tomorrow we'll move back mailman to GFS. There are three different netconsole log recordings at http://www.kfki.hu/~kadlec/gfs/, that's all I could do. If there'll be some patches I'll try to test it at one of the nodes but it can't be the one which runs the mailman queue manager and so far I could not find any other method to crash the system at will but to run it. That's a debugging problem to solve. > BTW, I do have opinions on other parts of your postings but don't have > time to express them now. Maybe I'll say something when I finish my > current chores :) I'd definitiely like to read your opinion! We'll reorganize one of the AOE blades by backing up the GFS volume and creating a smaller one to make space for a new GFS2 test volume. Best regards, Jozsef -- E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster