On Fri, Jun 27, 2008 at 01:41:17PM -0500, David Teigland wrote: > On Fri, Jun 27, 2008 at 01:28:56PM -0400, david m. richter wrote: > > i also have another setup in vmware; while i doubt it's > > substantively different than bruce's, i'm a ready and willing tester. is > > there a different branch (or repo, or just a stack of patches somewhere) > > that i should/could be using? > > If on 2.6.25, then use > > ftp://ftp%40openais%2Eorg:downloads@xxxxxxxxxxx/downloads/openais-0.80.3/openais-0.80.3.tar.gz > ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz > > If on 2.6.26-rc, then you'll need to add the attached patch to cluster. I tried that patch against STABLE2, and needed the following to get it to compile. diff --git a/group/gfs_controld/plock.c b/group/gfs_controld/plock.c index 5e4f56b..f04a6b8 100644 --- a/group/gfs_controld/plock.c +++ b/group/gfs_controld/plock.c @@ -790,7 +790,7 @@ static void write_result(struct mountgroup *mg, struct dlm_plock_info *in, in->fsid = mg->associated_ls_id; in->rv = rv; - write(control_fd, in, sizeof(struct gdlm_plock_info)); + write(control_fd, in, sizeof(struct dlm_plock_info)); } static void do_waiters(struct mountgroup *mg, struct resource *r) I built everything with debugging turned on. The second mount again hangs, with a lot of this in the logs: Jul 1 14:06:42 piglet2 kernel: dlm: connecting to 1 Jul 1 14:06:42 piglet2 kernel: dlm: connect from non cluster node Jul 1 14:06:42 piglet2 kernel: dlm: connect from non cluster node Jul 1 14:08:35 piglet2 kernel: INFO: task mount.gfs2:6130 blocked for more than 120 seconds. Jul 1 14:08:35 piglet2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 14:08:35 piglet2 kernel: mount.gfs2 D c09f0244 1896 6130 6129 Jul 1 14:08:35 piglet2 kernel: ce920bc4 00000046 ce9d28e0 c09f0244 6f5e11cb 00000621 ce9d2b40 ce9d2b40 Jul 1 14:08:35 piglet2 kernel: 00000046 cf167db8 ce9d28e0 0077d2a4 00000000 6fd5e46f 00000621 ce9d28e0 Jul 1 14:08:35 piglet2 kernel: 00000003 ce9e7874 00000002 7fffffff ce920bec c063cdc5 7fffffff ce920be0 Jul 1 14:08:35 piglet2 kernel: Call Trace: Jul 1 14:08:35 piglet2 kernel: [<c063cdc5>] schedule_timeout+0x75/0xb0 Jul 1 14:08:35 piglet2 kernel: [<c0138ccd>] ? trace_hardirqs_on+0x9d/0x110 Jul 1 14:08:35 piglet2 kernel: [<c063c60e>] wait_for_common+0x9e/0x110 Jul 1 14:08:35 piglet2 kernel: [<c0116340>] ? default_wake_function+0x0/0x10 Jul 1 14:08:35 piglet2 kernel: [<c063c712>] wait_for_completion+0x12/0x20 Jul 1 14:08:35 piglet2 kernel: [<c01bdf06>] dlm_new_lockspace+0x766/0x7f0 Jul 1 14:08:35 piglet2 kernel: [<c03b9734>] gdlm_mount+0x304/0x430 Jul 1 14:08:35 piglet2 kernel: [<c03a7bcf>] gfs2_mount_lockproto+0x13f/0x160 Jul 1 14:08:35 piglet2 kernel: [<c03ad252>] fill_super+0x3d2/0x6e0 Jul 1 14:08:35 piglet2 kernel: [<c03a0df0>] ? gfs2_glock_cb+0x0/0x150 Jul 1 14:08:35 piglet2 kernel: [<c01ade75>] ? disk_name+0x25/0x90 Jul 1 14:08:35 piglet2 kernel: [<c016db3f>] get_sb_bdev+0xef/0x120 Jul 1 14:08:35 piglet2 kernel: [<c0182435>] ? alloc_vfsmnt+0xd5/0x110 Jul 1 14:08:35 piglet2 kernel: [<c03abe25>] gfs2_get_sb+0x15/0x40 Jul 1 14:08:35 piglet2 kernel: [<c03ace80>] ? fill_super+0x0/0x6e0 Jul 1 14:08:35 piglet2 kernel: [<c016d613>] vfs_kern_mount+0x53/0x120 Jul 1 14:08:35 piglet2 kernel: [<c016d731>] do_kern_mount+0x31/0xc0 Jul 1 14:08:35 piglet2 kernel: [<c0183626>] do_new_mount+0x56/0x80 Jul 1 14:08:35 piglet2 kernel: [<c0183816>] do_mount+0x1c6/0x1f0 Jul 1 14:08:35 piglet2 kernel: [<c0166c91>] ? cache_alloc_debugcheck_after+0x71/0x1a0 Jul 1 14:08:35 piglet2 kernel: [<c014f69b>] ? __get_free_pages+0x1b/0x30 Jul 1 14:08:35 piglet2 kernel: [<c01814ea>] ? copy_mount_options+0x2a/0x130 Jul 1 14:08:35 piglet2 kernel: [<c01838aa>] sys_mount+0x6a/0xb0 Jul 1 14:08:35 piglet2 kernel: [<c0103182>] syscall_call+0x7/0xb Jul 1 14:08:35 piglet2 kernel: ======================= Jul 1 14:08:35 piglet2 kernel: 4 locks held by mount.gfs2/6130: Jul 1 14:08:35 piglet2 kernel: #0: (&type->s_umount_key#20){--..}, at: [<c016ce66>] sget+0x176/0x360 Jul 1 14:08:35 piglet2 kernel: #1: (lmh_lock){--..}, at: [<c03a7ab0>] gfs2_mount_lockproto+0x20/0x160 Jul 1 14:08:35 piglet2 kernel: #2: (&ls_lock){--..}, at: [<c01bd7be>] dlm_new_lockspace+0x1e/0x7f0 Jul 1 14:08:35 piglet2 kernel: #3: (&ls->ls_in_recovery){--..}, at: [<c01bdd6f>] dlm_new_lockspace+0x5cf/0x7f0 Jul 1 14:10:44 piglet2 kernel: INFO: task mount.gfs2:6130 blocked for more than 120 seconds. Jul 1 14:10:44 piglet2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 14:10:44 piglet2 kernel: mount.gfs2 D c09f0244 1896 6130 6129 So I gave up on this and tried going back to v2.6.25, and the suggested cluster-2.03.04, but the second mounts still hang, and a sysrq-T trace shows the mount system call hanging in dlm_new_workspace(). Since this I guess is a known-working set of software versions, I'm assuming there's something wrong with my setup.... It looks like dlm_new_workspace() is waiting on dlm_recoverd, which is in "D" state in dlm_rcom_status(), so I guess the second node isn't getting some dlm reply it expects? --b. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster