I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit a problem at Apr 6 05:30. So the test ran for 36 hours. cl030 and cl031 were getting "SM: process_reply invalid" messages and cl032 got "No response" and "Missed too many heartbeats" cl032: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl030a from the cluster : Missed too many heartbeats CMAN: removing node cl031a from the cluster : No response to messages CMAN: quorum lost, blocking activity [-- MARK -- Wed Apr 6 05:30:00 2005] GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" cl030: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl032a from the cluster : Missed too many heartbeats GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" GFS: fsid=gfs_cluster:stripefs.0: Joined cluster. Now mounting FS... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Done GFS: fsid=gfs_cluster:stripefs.0: jid=1: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Done GFS: fsid=gfs_cluster:stripefs.0: jid=2: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Done GFS: fsid=gfs_cluster:stripefs.0: jid=3: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Done SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 cl031: [-- MARK -- Wed Apr 6 05:15:00 2005] SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" SM: process_reply invalid id=20505 nodeid=4294967295 GFS: fsid=gfs_cluster:stripefs.1: Joined cluster. Now mounting FS... A bit more info is available here. http://developer.osdl.org/daniel/GFS/test.04apr2005/ Any ideas on what is going on? Daniel