On Mon, Apr 18, 2011 at 9:49 AM, Terry <td3201@xxxxxxxxx> wrote: > On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield > <ccaulfie@xxxxxxxxxx> wrote: >> On 18/04/11 15:11, Terry wrote: >>> >>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>> <ccaulfie@xxxxxxxxxx> wrote: >>>> >>>> On 18/04/11 14:38, Terry wrote: >>>>> >>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>> <ccaulfie@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>> >>>>>>> As a result of a strange situation where our licensing for storage >>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>> cluster. I got it joined to the cluster but I am having issues with >>>>>>> CLVMD. Any lvm operations on both boxes hang. For example, vgscan. >>>>>>> I have increased debugging and I don't see any logs. The VGs aren't >>>>>>> being populated in /dev/mapper. This WAS working right after I joined >>>>>>> it to the cluster and now it's not for some unknown reason. Not sure >>>>>>> where to take this at this point. I did find one weird startup log >>>>>>> that I am not sure what it means yet: >>>>>>> [root@omadvnfs01a ~]# dmesg | grep dlm >>>>>>> dlm: no local IP address has been set >>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>> dlm: Using TCP for communications >>>>>>> dlm: connecting to 2 >>>>>>> >>>>>> >>>>>> >>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>> switch >>>>>> and read the output which might give some clues to why it's not >>>>>> working. >>>>>> >>>>>> Chrissie >>>>>> >>>>> >>>>> >>>>> Hi Chrissie, >>>>> >>>>> I thought of that but I see dlm started on both nodes. See right below. >>>>> >>>>>>> [root@omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>> root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 >>>>>>> /sbin/dlm_controld >>>>>>> root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> >>>> >>>> Well, that's encouraging in a way! But it's evidently not started fully >>>> or >>>> the DLM itself would be working. So I still recommend starting it with -D >>>> to >>>> see how far it gets. >>>> >>>> >>>> Chrissie >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> I think we had posts cross. Here's my latest: >>> >>> Ok, started all the CMAN elements manually as you suggested. I >>> started them in order as in the init script. Here's the only error >>> that I see. I can post the other debug messages if you think they'd >>> be useful but this is the only one that stuck out to me. >>> >>> [root@omadvnfs01a ~]# /sbin/dlm_controld -D >>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>> 1303134840 set_ccs_options 480 >>> 1303134840 cman: node 2 added >>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>> 1303134840 cman: node 3 added >>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>> >> >> Can I see the whole set please ? It looks like dlm_controld might be stalled >> registering with groupd. >> >> Chrissie >> >> -- > > Here you go. Thank you very much for the help. Each daemon's output > that I started is below. > > [root@omadvnfs01a log]# /sbin/ccsd -n > Starting ccsd 2.0.115: > Built: Mar 6 2011 00:47:03 > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > No Daemon:: SET > > cluster.conf (cluster name = omadvnfs01, version = 71) found. > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Initial status:: Quorate > > [root@omadvnfs01a ~]# /sbin/fenced -D > 1303134822 cman: node 2 added > 1303134822 cman: node 3 added > 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc > 1303134822 listen 4 member 5 groupd 7 > 1303134861 client 3: join default > 1303134861 delay post_join 3s post_fail 0s > 1303134861 added 2 nodes from ccs > 1303134861 setid default 65537 > 1303134861 start default 1 members 2 3 > 1303134861 do_recovery stop 0 start 1 finish 0 > 1303134861 finish default 1 > > [root@omadvnfs01a ~]# /sbin/dlm_controld -D > 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 > 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 > 1303134840 set_ccs_options 480 > 1303134840 cman: node 2 added > 1303134840 set_configfs_node 2 10.198.1.111 local 0 > 1303134840 cman: node 3 added > 1303134840 set_configfs_node 3 10.198.1.110 local 1 > > > [root@omadvnfs01a ~]# /sbin/groupd -D > 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 > 1303134809 setup_cpg groupd_handle 6b8b456700000000 > 1303134809 groupd confchg total 2 left 0 joined 1 > 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 > 1303134822 client connection 3 > 1303134822 got client 3 setup > 1303134822 setup fence 0 > 1303134840 client connection 4 > 1303134840 got client 4 setup > 1303134840 setup dlm 1 > 1303134853 client connection 5 > 1303134853 got client 5 setup > 1303134853 setup gfs 2 > 1303134861 got client 3 join > 1303134861 0:default got join > 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 > 1303134861 0:default cpg_join ok > 1303134861 0:default waiting for first cpg event > 1303134861 client connection 7 > 1303134861 0:default waiting for first cpg event > 1303134861 got client 7 get_group > 1303134861 0:default waiting for first cpg event > 1303134861 0:default waiting for first cpg event > 1303134861 0:default confchg left 0 joined 1 total 2 > 1303134861 0:default process_node_join 3 > 1303134861 0:default cpg add node 2 total 1 > 1303134861 0:default cpg add node 3 total 2 > 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 > 1303134861 0:default queue join event for nodeid 3 > 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN > 1303134861 0:default app node init: add 3 total 1 > 1303134861 0:default app node init: add 2 total 2 > 1303134861 0:default waiting for 1 more stopped messages before > JOIN_ALL_STOPPED > > 3 > 1303134861 0:default mark node 2 stopped > 1303134861 0:default set global_id 10001 from 2 > 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED > 1303134861 0:default action for app: setid default 65537 > 1303134861 0:default action for app: start default 1 2 2 2 3 > 1303134861 client connection 7 > 1303134861 got client 7 get_group > 1303134861 0:default mark node 2 started > 1303134861 client connection 7 > 1303134861 got client 7 get_group > 1303134861 got client 3 start_done > 1303134861 0:default send started > 1303134861 0:default mark node 3 started > 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED > 1303134861 0:default action for app: finish default 1 > 1303134862 client connection 7 > 1303134862 got client 7 get_group > > > [root@omadvnfs01a ~]# /sbin/gfs_controld -D > 1303134853 config_no_withdraw 0 > 1303134853 config_no_plock 0 > 1303134853 config_plock_rate_limit 100 > 1303134853 config_plock_ownership 0 > 1303134853 config_drop_resources_time 10000 > 1303134853 config_drop_resources_count 10 > 1303134853 config_drop_resources_age 10000 > 1303134853 protocol 1.0.0 > 1303134853 listen 3 > 1303134853 cpg 6 > 1303134853 groupd 7 > 1303134853 uevent 8 > 1303134853 plocks 10 > 1303134853 plock need_fsid_translation 1 > 1303134853 plock cpg message size: 336 bytes > 1303134853 setup done > Another gap that I just found is I forgot to specify a fencing method for the new centos node. I put that in and now the rhel node wants to fence it so I am letting it do that then i'll see where i end up. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster