On Sat, 28 Mar 2009, Kadlecsik Jozsef wrote: > On Fri, 27 Mar 2009, Bob Peterson wrote: > > > Perhaps you should change your post_fail_delay to some very high > > number, recreate the problem, and when it freezes force a > > sysrq-trigger to get call traces for all the processes. > > Then also you can look at the dmesg to see if there was a kernel > > panic or something on the node that would otherwise be > > immediately fenced. > > I enabled more kernel debugging, netconsole and captured the attaced > console log. I hope it gives the required info. I should get some sleep - but can't it be that I hit the potential deadlock mentioned here: commit 4787e11dc7831f42228b89ba7726fd6f6901a1e3 gfs-kmod: workaround for potential deadlock. Prefault user pages The bug uncovered in 461770 does not seem fixable without a massive change to how gfs works. There is a lock ordering mismatch between the process address space lock and the glocks. The only good way to avoid this in all cases is to not hold the glock for so long, which is what gfs2 does. This is impossible without completely changing how gfs does locking. Fortunately, this is only a problem when you have multiple processes sharing an address space, and are doing IO to a gfs file with a userspace buffer that's part of an mmapped gfs file. In this case, prefaulting the buffer's pages immediately before acquiring the glocks significantly shortens the window for this deadlock. Closing the window any more causes a large performance hit. Mailman do mmap files... Best regards, Jozsef -- E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster