Jeff Sturm wrote: > > Recently we had a cluster node fail with a failed assertion: > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal: > assertion "gfs_glock_is_locked_by_me(gl) && > gfs_glock_is_held_excl(gl)" failed > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function = > gfs_trans_add_gl > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file = > /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, > line = 237 > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time = > 1246022619 > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to > withdraw from the cluster > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM > to withdraw > > This is with CentOS 5.2, GFS1. The cluster had been operating > continuously for about 3 months. > > My challenge isn't in preventing assertion failures entirely—I > recognize lurking software bugs and hardware anomalies can lead to a > failed node. Rather, I want to prevent one node from freezing the > cluster. When the above was logged, all nodes in the cluster which > access the tb2data filesystem also froze and did not recover. We > recovered with a rolling cluster restart and a precautionary gfs_fsck. > > Most cluster problems can be quickly handled by the fence agents. The > "telling LM to withdraw" does not trigger a fence operation, or any > other automated recovery. I need a deployment strategy to fix that. > Should I write an agent to scan the syslog, match on the message > above, and fence the node? > > Has anyone else encountered the same problem? If so, how did you get > around it? > > -Jeff > https://bugzilla.redhat.com/show_bug.cgi?id=471258 The assert+withdraw you're seeing seems to be this bug above. I've tried to recreate this on my cluster and failed. If you have a recipe to create this, could you please post it to the bugzilla? Meanwhile, I'll look at the code again to see if I can spot anything. Thanks! --Abhi -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster