Re: cLVM unusable on quorated cluster

Digimer <lists@xxxxxxxxxx> · Fri, 03 Oct 2014 10:38:14 -0400

On 03/10/14 10:35 AM, Daniel Dehennin wrote:
Hello,

I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
for an OpenNebula cluster.

As I'm new to cluster world, I have hard time figuring why sometime
things get really wrong and where I must look to find answers.

My OpenNebula frontend, running in a VM, does not manage to run the
resources and my syslog has a lot of:

#+begin_src
ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist
#+end_src

When this happens, other nodes have problem:

#+begin_src
root@nebula3:~# LANG=C vgscan
   cluster request failed: Host is down
   Unable to obtain global lock.
#+end_src

But things looks fin in “crm_mon”:

#+begin_src
root@nebula3:~# crm_mon -1
============
Last updated: Fri Oct  3 16:25:43 2014
Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
Stack: openais
Current DC: nebula3 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
5 Nodes configured, 5 expected votes
32 Resources configured.
============

Node quorum: standby
Online: [ nebula3 nebula2 nebula1 ]
OFFLINE: [ one ]

  Stonith-nebula3-IPMILAN    (stonith:external/ipmi):    Started nebula2
  Stonith-nebula2-IPMILAN    (stonith:external/ipmi):    Started nebula3
  Stonith-nebula1-IPMILAN    (stonith:external/ipmi):    Started nebula2
  Clone Set: ONE-Storage-Clone [ONE-Storage]
      Started: [ nebula1 nebula3 nebula2 ]
      Stopped: [ ONE-Storage:3 ONE-Storage:4 ]
  Quorum-Node    (ocf::heartbeat:VirtualDomain): Started nebula3
  Stonith-Quorum-Node   (stonith:external/libvirt):   Started nebula3
#+end_src

I don't know how to interpret dlm_tool informations:

#+begin_src
root@nebula3:~# dlm_tool ls -n
dlm lockspaces
name          CCB10CE8D4FF489B9A2ECB288DACF2D7
id            0x09250e49
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 2,2
members       1189587136 1206364352 1223141568
all nodes
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none

name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 3 joined 0 remove 1 failed 0 seq 4,4
members       1189587136 1206364352 1223141568
all nodes
nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
#+end_src

Is there any documentation on troubleshooting DLM/cLVM?

Regards.

Can you paste your full pacemaker config and the logs from the other 
nodes starting just before the lost node went away?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster