Hi, We've been using redhat GFS 6.0.0-1.2 for a few months now with generally very good reliability. We have the system configured with 5 lock servers (2 of which are the servers that physically mount the shared storage which is connected via shared scsi on an hp dl380 packaged cluster). We've taken down slave lock servers before without incident but recently we went to reboot the master lock server and the filesystem became inaccessible from the other server. The logs indicate what _seems_ to be a successful change of the master lock server over to one of the other nodes but then there are messages indicating that a request for a lock on the filesystem results in a 'Busy' message. We then took all the servers down and booted up one of the filesystem mounting nodes and all was well, no need for fsck. The 5 nodes are as follows: cluster1 (lock + fs node, was master, shutdown initiated) cluster2 (lock +fs node, becomes master on cluster1 shutdown) lvs1 (lock node) lvs2 (lock node) intra4 (lock node) the locking is done with pool on two volumes, pool_home and pool_shared. ccsd uses the shared storage for the cluster config info on cluster1 and cluster2, and local files for the rest of the lock servers. If I need to provide more information I'd be happy to post full config details. Here are the messages in the logs on cluster2 around the time of the shutdown. Any help on how to prevent this lockup from happening (or pointers on what I'm doing wrong) would be appreciated! Thanks! ------------------------- Feb 18 17:02:24 cluster2 kernel: lock_gulm: Checking for journals for node "cluster1.sonitrol.net" Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: Master Node has logged out. Feb 18 17:02:24 cluster2 kernel: lock_gulm: Checking for journals for node "cluster1.sonitrol.net" Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: ERROR [core_io.c:1029] Got error from reply: (cluster1.sonitrol.net:172.16.6.131) 1:Unknown GULM Err Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: ERROR [core_io.c:1034] Errors on xdr: (cluster1.sonitrol.net:172.16.6.131) -104:104:Connection reset by peer Feb 18 17:02:33 cluster2 lock_gulmd_core[1345]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Feb 18 17:02:33 cluster2 lock_gulmd_LTPX[1351]: New Master at cluster2.sonitrol.net:172.16.6.132 Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: lvs1.sonitrol.net missed a heartbeat (time:1108764168893978 mb:1) Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: lvs2.sonitrol.net missed a heartbeat (time:1108764168893978 mb:1) Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: intra4 missed a heartbeat (time:1108764168893978 mb:1) Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Still in Arbitrating: Have 2, need 3 for quorum. Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: New Client: idx:5 fd:10 from (172.16.6.150:intra4) Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message Logged in about intra4 to lvs1.sonitrol.net is lost because node is in OM Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message Logged in about intra4 to lvs2.sonitrol.net is lost because node is in OM Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Now have Slave quorum, going full Master. Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: New Client: idx:6 fd:11 from (172.16.6.231:lvs2.sonitrol.net) Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message Logged in about lvs2.sonitrol.net to lvs1.sonitrol.net is lost because node is in OM Feb 18 17:02:49 cluster2 lock_gulmd_LTPX[1351]: Logged into LT000 at cluster2.sonitrol.net:172.16.6.132 Feb 18 17:02:49 cluster2 lock_gulmd_LTPX[1351]: Finished resending to LT000 Feb 18 17:02:50 cluster2 lock_gulmd_LT000[1348]: New Client: idx 2 fd 7 from (172.16.6.132:cluster2.sonitrol.net) Feb 18 17:02:50 cluster2 lock_gulmd_LT000[1348]: New Client: idx 3 fd 8 from (172.16.6.231:lvs2.sonitrol.net) Feb 18 17:02:50 cluster2 lock_gulmd_core[1345]: New Client: idx:7 fd:12 from (172.16.6.230:lvs1.sonitrol.net) Feb 18 17:02:50 cluster2 lock_gulmd_core[1345]: Timeout (15000000) on fd:5 (cluster1.sonitrol.net:172.16.6.131) Feb 18 17:02:52 cluster2 lock_gulmd_LT000[1348]: Attached slave lvs2.sonitrol.net:172.16.6.231 idx:4 fd:9 (soff:3 connected:0x8) Feb 18 17:02:52 cluster2 lock_gulmd_LT000[1348]: New Client: idx 5 fd 10 from (172.16.6.230:lvs1.sonitrol.net) Feb 18 17:02:54 cluster2 lock_gulmd_LT000[1348]: Attached slave lvs1.sonitrol.net:172.16.6.230 idx:6 fd:11 (soff:2 connected:0xc) Feb 18 17:02:54 cluster2 lock_gulmd_LT000[1348]: New Client: idx 7 fd 12 from (172.16.6.150:intra4) Feb 18 17:03:04 cluster2 lock_gulmd_LT000[1348]: Attached slave intra4:172.16.6.150 idx:8 fd:13 (soff:1 connected:0xe) Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying to acquire journal lock... Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying to acquire journal lock... Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying to acquire journal lock... Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying to acquire journal lock... Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy Feb 18 17:05:03 cluster2 lock_gulmd_core[1345]: New Client: idx:1 fd:5 from (172.16.6.131:cluster1.sonitrol.net) Feb 18 17:05:05 cluster2 lock_gulmd_LT000[1348]: Attached slave cluster1.sonitrol.net:172.16.6.131 idx:9 fd:14 (soff:0 connected:0xf) Feb 18 17:05:05 cluster2 lock_gulmd_LT000[1348]: New Client: idx 10 fd 15 from (172.16.6.131:cluster1.sonitrol.net) Feb 18 17:05:06 cluster2 ypserv[1395]: refused connect from 172.16.6.131:753 to procedure ypproc_all (LTSP,auto.master;-4) Feb 18 17:05:55 cluster2 login(pam_unix)[1828]: session opened for user root by LOGIN(uid=0) Feb 18 17:05:55 cluster2 -- root[1828]: ROOT LOGIN ON tty2 Feb 18 17:06:02 cluster2 lock_gulmd_core[1345]: "cluster1.sonitrol.net" is logged out. fd:5 Feb 18 17:06:02 cluster2 kernel: lock_gulm: Checking for journals for node "cluster1.sonitrol.net" Feb 18 17:06:02 cluster2 lock_gulmd_LT000[1348]: EOF on xdr (cluster1.sonitrol.net:172.16.6.131 idx:10 fd:15) Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying to acquire journal lock... Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying to acquire journal lock... Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy ----------------------------- -- ---------------------------------- Marc Swanson, Software Engineer Sonitrol Communications Corp. Hartford, CT Email: mswanson@xxxxxxxxxxxx Phone: (860) 616-7036 Pager: (860) 948-6713 Cell: (603) 512-1267 Fax: (860) 616-7589 ----------------------------------