On Tue, Sep 21, 2004 at 08:39:21PM +0200, Lazar Obradovic wrote: > Hi, > > I more often then not have a problem when starting clvmd. It starts > normaly, but /proc/cluster/services, says: > > # cat /proc/cluster/services > Service Name GID LID State Code > Fence Domain: "default" 2 2 run - > [5 7 6 4 2 3 1] > > DLM Lock Space: "clvmd" 0 3 join S-1,80,7 > [] > > > while other nodes report: > > # cat /proc/cluster/services > Service Name GID LID State Code > Fence Domain: "default" 2 2 run - > [4 2 5 3 6 7] > > DLM Lock Space: "clvmd" 1 3 update U-4,1,7 > [4 2 5 3 6 7] > > vgchage will hung afterwards and only reboot would (eventualy) fix the > problem. Other nodes are working just fine in the meantime... > What do "code" flags *exactly* mean? for update events begining with "U-" 4 = ue_state = UEST_JSTART_SERVICEWAIT 1 = ue_flags = UEFL_ALLOW_STARTDONE 7 = ue_nodeid = nodeid of node joining or leaving the sg SM is waiting for the dlm service to complete recovery. The dlm on nodes [4 2 5 3 6 7] is still in the process of recovery due to node 7 joining the lockspace. If it stays this way for long, it probably means that dlm recovery is hung for some reason. dmesg or /proc/cluster/dlm_debug should show roughly how far the dlm recovery got. for service events begining with "S-" 1 = se_state = SEST_JOIN_BEGIN 80 = se_flags = SEFL_DELAY 7 = se_reply_count = number of replies received SM will not permit this node to join the lockspace because the lockspace in question is still doing recovery. Once recovery completes, this node will go ahead and join. -- Dave Teigland <teigland@xxxxxxxxxx>