Re: DLM won't (stay) running

Andrew Price <anprice@xxxxxxxxxx> · Wed, 9 May 2018 11:26:13 +0100

[linux-cluster@ isn't really used nowadays; CCing users@clusterlabs]

On 08/05/18 12:18, Jason Gauthier wrote:
Greetings,

    I'm working on a setup of a two-node cluster with shared storage.
I've been able to see the storage on both nodes, and appropriate
configuration for fencing the bock device.

The next step was getting DLM and GFS2 in a clone group to mount the
FS on both drives.  This is where I am running into trouble.

As far as the OS goes, it's debian.  I'm using pacemaker, corosync,
and crm for cluster management.

Is it safe to assume that you're using Debian Wheezy? (The need for 
gfs_controld disappeared in the 3.3 kernel.) As wheezy goes end-of-life 
at the end of the month I would suggest upgrading, you will likely find 
the cluster tools more user friendly and the components more stable.

Andy

At the moment, I've removed the gfs2 parts just to try and get dlm working.

My current config looks like this:

node 1084772368: alpha
node 1084772369: beta
primitive p_dlm_controld ocf:pacemaker:controld \
         op monitor interval=60 timeout=60 \
         meta target-role=Started args=-K
primitive p_gfs_controld ocf:pacemaker:controld \
         params daemon=gfs_controld \
         meta target-role=Started
primitive stonith_sbd stonith:external/sbd \
         params pcmk_delay_max=30 sbd_device="/dev/sdb1"
group g_gfs2 p_dlm_controld p_gfs_controld
clone cl_gfs2 g_gfs2 \
         meta interleave=true target-role=Started
property cib-bootstrap-options: \
         have-watchdog=false \
         dc-version=1.1.16-94ff4df \
         cluster-infrastructure=corosync \
         cluster-name=zeta \
         last-lrm-refresh=1525523370 \
         stonith-enabled=true \
         stonith-timeout=20s

When a bring the resources up, I get a quick blip in my logs.
May  8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started
May  8 07:14:00 beta kernel: [253558.641658] dlm: closing connection
to node 1084772369
May  8 07:14:00 beta kernel: [253558.641764] dlm: closing connection
to node 1084772368

This is the same messaging I see when I run dlm manually and then stop
it.  My challenge here is that I cannot find out what dlm is doing.
I've tried adding -K to /etc/default/dlm, but I don't think that file
is being respected. I would like to figure out how to increase the
verbose output of dlm_controld so I can see why it won't stay running
when it's launched through the cluster.   I haven't been able to
figure out how to pass arguments directly to the a daemon in the
primitive config, if it's even possible.  Otherwise, I would try to
pass -K there.

Thanks!

Jason

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster