Hello, again we had the same problem as stated in January. We installed the hotfix but it didn't help. Again the whole cluster freezed, no node was allowed to rejoin the fencedomain. Any ideas or do you need any more information? Thanks and Regards Marc. Mar 22 04:04:03 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:04:06 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 04:04:31 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:04:33 lilr623b clurgmgrd[12855]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 04:04:50 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:05:18 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:05:35 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 04:06:03 lilr623b clurgmgrd[12855]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 04:06:21 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:06:33 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 04:07:05 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 07:09:39 lilr623d kernel: CMAN: node lilr623f-ics0 has been removed from the cluster : Missed too many heartbeats Mar 22 07:09:39 lilr623c kernel: CMAN: node lilr623f-ics0 has been removed from the cluster : Missed too many heartbeats Mar 22 07:09:39 lilr623d kernel: dlm: lt_sharedroot: send_cluster_request to 3 state 1 recovery Mar 22 07:10:00 lilr623d kernel: CMAN: node lilr623b-ics0 has been removed from the cluster : Missed too many heartbeats Mar 22 07:10:00 lilr623c kernel: CMAN: removing node lilr623b-ics0 from the cluster : Missed too many heartbeats Mar 22 07:10:05 lilr623c kernel: dlm: lt_sharedroot: dlm_dir_rebuild_local failed -1 Mar 22 07:10:05 lilr623d kernel: dlm: lt_sharedroot: dlm_dir_rebuild_local failed -1 Mar 22 07:10:05 lilr623c kernel: dlm: lt_scratch: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:05 lilr623d kernel: dlm: lt_scratch: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:10 lilr623c kernel: dlm: lt_products: restbl_rsb_update failed -1 Mar 22 07:10:10 lilr623d kernel: dlm: lt_products: restbl_rsb_update failed -1 Mar 22 07:10:11 lilr623c kernel: dlm: lt_P06user: dlm_dir_rebuild_local failed -1 Mar 22 07:10:11 lilr623d kernel: dlm: lt_P06user: dlm_dir_rebuild_local failed -1 Mar 22 07:10:11 lilr623c kernel: dlm: lt_P06user1: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:11 lilr623d kernel: dlm: lt_P06user1: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06sap: dlm_dir_rebuild_local failed -1 Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06origlogA: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:16 lilr623c kernel: dlm: lt_P06sap: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:20 lilr623d kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:21 lilr623c kernel: dlm: lt_P06origlogA: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:21 lilr623c kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:22 lilr623c kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06origlogD: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06mirrlogA: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:27 lilr623c kernel: dlm: lt_P06origlogD: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:27 lilr623c kernel: dlm: lt_P06mirrlogA: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:28 lilr623c kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:28 lilr623c kernel: dlm: lt_P06mirrlogC: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:29 lilr623c kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:30 lilr623c kernel: dlm: lt_P06arch: restbl_rsb_update failed -1 Mar 22 07:10:30 lilr623c kernel: dlm: lt_P06data1: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:30 lilr623d kernel: dlm: lt_P06mirrlogC: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:30 lilr623d kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:31 lilr623c kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:35 lilr623c kernel: dlm: lt_P06data3: restbl_rsb_update failed -1 Mar 22 07:10:35 lilr623d kernel: dlm: lt_P06arch: restbl_rsb_update failed -1 Mar 22 07:10:35 lilr623c kernel: dlm: lt_P06data4: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:35 lilr623d kernel: dlm: lt_P06data1: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:36 lilr623d kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:41 lilr623d kernel: dlm: lt_P06data3: restbl_rsb_update failed -1 Mar 22 07:10:41 lilr623d kernel: dlm: lt_P06data4: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:41 lilr623c kernel: dlm: clvmd: dlm_dir_rebuild_wait failed -1 Mar 22 07:10:41 lilr623c kernel: dlm: Magma: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:41 lilr623d kernel: dlm: clvmd: dlm_dir_rebuild_wait failed 1 Mar 22 07:10:42 lilr623d kernel: dlm: Magma: dlm_dir_rebuild_wait failed 1 Mar 22 07:11:05 lilr623c fenced[15490]: fencing deferred to lilr623a-ics0 Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data3.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data3.2: jid=5: Looking at journal... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data2.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data1.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: Looking at journal... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data1.2: jid=5: Looking at journal... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=5: Busy Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06arch.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06arch.0: jid=5: Looking at journal... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: Looking at journal... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogD.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogC.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=5: Busy Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogC.2: jid=5: Looking at journal... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=4: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogC.0: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: Looking at journal... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=5: Busy Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogA.2: jid=5: Trying to acquire journal lock... Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=4: Trying to acquire journal lock... Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogD.2: jid=5: Trying to acquire journal lock... .... Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_products.2: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623c kernel: lock_dlm: lm_dlm_cancel 1,2 flags 84 Mar 22 07:11:36 lilr623c kernel: lock_dlm: lm_dlm_cancel skip 1,2 flags 84 Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: Acquiring the transaction lock... Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_scratch.2: jid=4: Busy Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogB.2: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogA.2: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06origlogD.0: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogC.0: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06origlogC.0: jid=4: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user1.0: jid=5: Acquiring the transaction lock... Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=5: Done Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=4: Trying to acquire journal lock... Mar 22 07:11:37 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_sharedroot.2: jid=5: Acquiring the transaction lock... Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=4: Busy Mar 22 07:11:37 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogD.2: jid=5: Acquiring the transaction lock... Mar 22 07:11:37 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: Acquiring the transaction lock... Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #50: Unable to obtain cluster lock: Connection timed out Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_P06data2.0: jid=4: Acquiring the transaction lock... ... Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: Done Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=4: Trying to acquire journal lock... Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06user1.0: jid=4: Busy Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=4: Replayed 0 of 0 blocks Mar 22 07:11:39 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogC.2: jid=4: Trying to acquire journal lock... Mar 22 07:11:39 lilr623a shutdown: shutting down for system reboot Mar 22 07:11:39 lilr623a kernel: dlm: lt_products: restbl_rsb_update failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait failed -1 Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06data3: restbl_rsb_update failed -1 Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data4.1: jid=5: Trying to acquire journal lock... Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data1.1: jid=5: Trying to acquire journal lock... Mar 22 07:11:39 lilr623d clurgmgrd[20148]: <info> State change: lilr623f-ics0 DOWN Mar 22 07:11:39 lilr623e kernel: rh_lkid 2bd03c3 Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06origlogD.1: jid=5: Trying to acquire journal lock... Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_sharedroot.0: jid=4: Busy Mar 22 07:11:39 lilr623e kernel: lockstate 0 Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_scratch.1: jid=5: Busy Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data4.1: jid=4: Busy Mar 22 07:11:39 lilr623e kernel: rh_cmd 5 Mar 22 07:11:39 lilr623e kernel: nodeid 5 Mar 22 07:11:39 lilr623e kernel: dlm: Magma: reply from 2 no lock Mar 22 07:11:39 lilr623e kernel: CMAN: node lilr623b-ics0 has been removed from the cluster : Missed too many heartbeats On Monday 29 January 2007 19:44:46 Lon Hohberger wrote: > On Fri, 2007-01-26 at 19:28 +0100, Marc Grimme wrote: > > On Friday 26 January 2007 19:15, Lon Hohberger wrote: > > > On Fri, 2007-01-26 at 09:19 +0100, Marc Grimme wrote: > > > > Hello, > > > > yesterday we saw a clusterfreeze (which seems to come from the > > > > rgmanager) with RHEL4/U4 GFS installed (see logs) consisting of 6 > > > > nodes x86_64 Architecture. After fencing one node the cluster came > > > > back to live. Any idea what could have happend? > > > > > > Check 'dmesg' and 'cman_tool status'. Also look at /proc/slabinfo, > > > specifically 'dlm_lkb' bits. There's a chance that you hit a bug > > > that's already fixed. :) > > > > dlm_lkb 189628 195177 232 17 1 : tunables 120 60 > > 8 : slabdata 11481 11481 384 > > nodea > > dlm_lkb 2074114 2077587 232 17 1 : tunables 120 60 > > 8 : slabdata 122211 122211 180 > > nodeb > > dlm_lkb 454319 499392 232 17 1 : tunables 120 60 > > 8 : slabdata 29376 29376 0 > > nodec > > dlm_lkb 242144 251719 232 17 1 : tunables 120 60 > > 8 : slabdata 14807 14807 480 > > noded > > dlm_lkb 248672 286382 232 17 1 : tunables 120 60 > > 8 : slabdata 16846 16846 212 > > nodef > > dlm_lkb 62934 62934 232 17 1 : tunables 120 60 > > 8 : slabdata 3702 3702 0 > > You've hit "the bug". > > > > Need above information (and possibly more) to answer this. > > > > What more?? ;-) > > Nothing; test packages here: > > http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.i386.rpm > http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.x86_64.rpm > http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.src.rpm > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** Visit us at CeBIT 2007 in Hannover/Germany ** ** in Hall 5, Booth G48/2 (15.-21. of March) ** ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany Registergericht: Amtsgericht München Registernummer: HRB 131682 USt.-Id.: DE209485962 Geschäftsführung: Marc Grimme, Mark Hlawatschek, Thomas Merz -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster