RE: Linux clustering (one-node), GFS, iSCSI, clvmd (lock problem)

berthiaume_wayne@xxxxxxx · Tue, 16 Oct 2007 08:11:57 -0400

I think it's because clvmd is trying to acquire the iSCSI LUNs and the
iSCSI driver has not come up fully yet. The network layer has to come
up, then iSCSI, then there iss a separate mount with a separate
liesystem tag _netdev that tells mount to wait for these. I'm not sue if
the same capabilities are in LVM to accommodate when an iSCSI device
comes up. This may by the reason they are missed by LVM.

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Paul Risenhoover
Sent: Tuesday, October 16, 2007 12:52 AM
To: Linux-cluster@xxxxxxxxxx
Subject:  Linux clustering (one-node), GFS, iSCSI,clvmd
(lock problem)

Hi All,

I am a noob to this maillist, but I've got some kind of locking problem 
with Linux and clusters, and iSCSI that plagues me.  It's a pretty 
serious issue because every time I reboot my server, it fails to mount 
my primary iSCSI device out of the box, and in order to get it working, 
I have to perform some pretty manual operations to get it operational
again.

Here is some configuration information:

Linux flax.xxx.com 2.6.9-55.0.9.ELsmp #1 SMP Thu Sep 27 18:27:41 EDT 
2007 i686 i686 i386 GNU/Linux

[root@flax ~]# clvmd -V
Cluster LVM daemon version: 2.02.21-RHEL4 (2007-04-17)
Protocol version:           0.2.1

dmesg (excerpted)
iscsi-sfnet: Loading iscsi_sfnet version 4:0.1.11-3
iscsi-sfnet: Control device major number 254
iscsi-sfnet:host3: Session established
scsi3 : SFNet iSCSI driver
  Vendor: Promise   Model: VTrak M500i       Rev: 2211
  Type:   Direct-Access                      ANSI SCSI revision: 04
sdh : very big device. try to use READ CAPACITY(16).
SCSI device sdh: 5859373056 512-byte hdwr sectors (2999999 MB)
SCSI device sdh: drive cache: write back
sdh : very big device. try to use READ CAPACITY(16).
SCSI device sdh: 5859373056 512-byte hdwr sectors (2999999 MB)
SCSI device sdh: drive cache: write back
 sdh: unknown partition table

[root@flax ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  flax                                     Online, Local, rgmanager

YES, THIS IS A ONE-NODE CLUSTER (Which, I suspect, might be the problem)

SYMPTOM:

When the server comes up, the clustered logical volume that is on the 
iSCSI device is labeled "inactive" when I do an "lvscan:"
[root@flax ~]# lvscan
  inactive            '/dev/nasvg_00/lvol0' [5.46 TB] inherit
  ACTIVE            '/dev/lgevg_00/lvol0' [3.55 TB] inherit
  ACTIVE            '/dev/noraidvg_01/lvol0' [546.92 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [134.47 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit

The thing that's interesting is the lgevg_00 and the noraidvg_01 volumes

are also clustered, but they are direct-attached SCSI (ie, not ISCSI).

The volume group that the logical volume is a member of shows clean:
[root@flax ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "nasvg_00" using metadata type lvm2
  Found volume group "lgevg_00" using metadata type lvm2
  Found volume group "noraidvg_01" using metadata type lvm2

So, in order to fix this, I execute the following:

[root@flax ~]# lvchange -a y /dev/nasvg_00/lvol0
Error locking on node flax: Volume group for uuid not found: 
oNhRO1WqNJp3BZxxrlMT16dwpwcRiIQPejnrEUbQ3HMJ6BjHef1hKAsoA6Sl9ISS

This also shows up in my syslog, as such:
Oct 13 11:27:40 flax vgchange:   Error locking on node flax: Volume 
group for uuid not found: 
oNhRO1WqNJp3BZxxrlMT16dwpwcRiIQPejnrEUbQ3HMJ6BjHef1hKAsoA6Sl9ISS

RESOLUTION:

It took me a very long time to figure this out, but since it happens to 
me every time I reboot my server, somebody's bound to run into this 
again sometime soon (and it will probably be me).

Here's how I resolved it:

I edited the /etc/lvm/lvm.conf file as such:

was:
    # Type of locking to use. Defaults to local file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata
corruption
    # if LVM2 commands get run concurrently).
    # Type 2 uses the external shared library locking_library.
    # Type 3 uses built-in clustered locking.
    #locking_type = 1
    locking_type = 3

changed to:

(snip)
    # Type 3 uses built-in clustered locking.
    #locking_type = 1
    locking_type = 2

Then, restart clvmd as such:
[root@flax ~]# service clvmd restart

Then:
[root@flax ~]# lvchange -a y /dev/nasvg_00/lvol0
[root@flax ~]#

(see, no error!)
[root@flax ~]# lvscan
  ACTIVE            '/dev/nasvg_00/lvol0' [5.46 TB] inherit
  ACTIVE            '/dev/lgevg_00/lvol0' [3.55 TB] inherit
  ACTIVE            '/dev/noraidvg_01/lvol0' [546.92 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [134.47 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit

(it's active!)

Then, go back and modify /etc/lvm/lvm.conf to restore the original 
locking_type to 3
Then, restart clvmd.

THOUGHTS:

I admit I don't know much about clustering, but from the evidence I see,

the problem appears to be isolated to clvmd and iSCSI, if only for the 
fact that my direct-attached clustered volumes don't exhibit the
symptoms.

I'll make another leap here and guess that it's probably isolated to 
single-node clusters, since I'd imagine that most people who are using 
clustering are probably using clustering as it was intended to be used 
(ie, multiple machines).

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster