I'm having trouble with a new two-node CentOS5 cluster (kernel 2.6.23). The clvm daemon doesn't run correctly on one node, and that node cannot see LVM objects, while it can connect to the underlying storage. The cluster has shared SAN storage with 5 LUNs presented to the servers. Multiple HBAs and storage processes mean there are 4 paths to each LUN. DM-Multipath is used to create virtual devices for LUN access. Each server sees the LUNs at the SCSI level, and can see disk partitions via fdisk, and can see the same LUNs via multipath The host-level storage configuration was done on node2, with text files being edited simultaneously on both nodes or copied from node2 to node1. ### Error Condition ### The clvmd daemon does not run correctly on node1 of the two-node cluster. The error messages are: connect() failed on local socket: Permission denied WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. At this point, node1 cannot access any shared objects (ie., pvdisplay fails to show details about the physical volumes, etc.). The daemon consistently starts without error on node2, whether node1 is running or not. The daemon consistently fails to start on node1, whether node2 is running or not. I've rebooted each node...the condition remains the same--only node2 can successfully start clvmd and access the LVM volumes. The /var/log/messages entry on both nodes is identical: clvmd: Cluster LVM daemon started - connected to CMAN but on node1 the process never gets out of the " clvmd -T20" state. There are no issues with SELinux blocking any lvm actions. I'd really appreciate any suggestions about debugging this problem. Extensive notes and command output are given below. Thanks, Mark --------------------------------------------------------------------------------- #### Configuration Procedures ### The following steps were run on each node: dm-multipath installed, confirmed that kernel modules are loaded setup /etc/multipath.conf (aliasing WWIDs to logical names, blacklisting WWIDs of internal drive) stop multipathd remove any multipath device entries with multipath -F Create /etc/lvm/lvm.conf (filtering internal drive, filtering /dev/sda devices The following commands were run _only_ on node2: restart multipath, recreating the devices based on /etc/multipath.conf copy the /var/lib/multipath/bindings file from node2 to node1 Create Physical Devices for LVM, as in: pvcreate -M2 -v /dev/mpath/home Create Volume Groups, as in: vgcreate -c y home_vg /dev/mpath/home Create Logical Volumes, as in: lvcreate -l 100%VG -n archive archive_vg Once the volumes are created, the Major/Minor block numbers were set to be persistent in order to help with NFS load balancing and fail-over across cluster nodes using lvchange, as in: lvchange --persistent y --major 253 --minor 8 archive_vg/archive Partition the Logical Volumes fdisk was used to set the filesystem offset to 128 blocks (64KB) to avoid boundary crossing on the EMC array Filesystems All paritions are formatted as gfs type filesystems as in: gfs_mkfs -j 4 -O -p lock_dlm -t sbia-infr:archive /dev/archive_vg/archive ### Component Versions ### clvmd version: Cluster LVM daemon version: 2.02.26-RHEL5 (2007-06-18) Protocol version: 0.2.1 pv* versions: Library version: 1.02.20 (2007-06-15) Driver version: 4.11.0 rgmanager: Version : 2.0.31 Vendor: CentOS Release : 1.el5.centos Build Date: Mon Nov 12 01:13:08 2007 #################################### Configuration ################################ The /etc/lvm/lvm.conf files are identical on both nodes. The locking value is set to: locking_type = 3 ### DM Multipath Details ### Both nodes have identical multipath configurations (/etc/multipath.conf, /dev/mpath/*, /dev/dm-* are identical). Both nodes show the same devices from "fdisk -l": [root@sbia-infr2 mpath]# fdisk -l /dev/mpath/* Disk /dev/mpath/archive: 1407.4 GB, 1407450152960 bytes 255 heads, 63 sectors/track, 171112 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/mpath/archive doesn't contain a valid partition table Disk /dev/mpath/cluster_shared: 288.1 GB, 288161071104 bytes 255 heads, 63 sectors/track, 35033 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/mpath/cluster_shared doesn't contain a valid partition table Disk /dev/mpath/comp_space: 2017.1 GB, 2017127497728 bytes 255 heads, 63 sectors/track, 245235 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/mpath/comp_space doesn't contain a valid partition table Disk /dev/mpath/home: 1152.6 GB, 1152644284416 bytes 255 heads, 63 sectors/track, 140134 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/mpath/home doesn't contain a valid partition table Disk /dev/mpath/sbiaprj: 2017.1 GB, 2017127497728 bytes 255 heads, 63 sectors/track, 245235 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/mpath/sbiaprj doesn't contain a valid partition table ### Example of clvmd, pvdisplay errors; group_tool, clustat, "clvmd -d" output ### ================ Node1 ==================================================== [root@sbia-infr1 lvm]# /etc/init.d/clvmd start Starting clvmd: [ OK ] Activating VGs: Logging initialised at Mon Jan 28 16:01:41 2008 Set umask to 0077 connect() failed on local socket: Permission denied WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Finding all volume groups Finding volume group "home_vg" Skipping clustered volume group home_vg Finding volume group "cluster_shared_vg" Skipping clustered volume group cluster_shared_vg Finding volume group "sbiaprj_vg" Skipping clustered volume group sbiaprj_vg Finding volume group "comp_space_vg" Skipping clustered volume group comp_space_vg Finding volume group "archive_vg" Skipping clustered volume group archive_vg Finding volume group "VolGroup00" 2 logical volume(s) in volume group "VolGroup00" already active 2 existing logical volume(s) in volume group "VolGroup00" monitored Found volume group "VolGroup00" Found volume group "VolGroup00" Activated logical volumes in volume group "VolGroup00" 2 logical volume(s) in volume group "VolGroup00" now active Wiping internal VG cache [root@sbia-infr1]# group_tool -v type level name id state node id local_done fence 0 default 00010002 JOIN_START_WAIT 1 100020001 1 [1 2] dlm 1 rgmanager 00030001 none [1 2] dlm 1 clvmd 00050001 none [root@sbia-infr1 mpath]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ sbia-infr2-admin.uphs.upenn.edu 1 Online, rgmanager sbia-infr1-admin.uphs.upenn.edu 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:vweb (none) stopped [root@sbia-infr1]# clvmd -d [clvmd PRODUCES THE FOLLOWING DEBUGGING OUTPUT ON STARTUP] CLVMD[4f71c4c0]: Jan 28 17:03:46 CLVMD started CLVMD[4f71c4c0]: Jan 28 17:03:46 Connected to CMAN CLVMD[4f71c4c0]: Jan 28 17:03:46 CMAN initialisation complete CLVMD[4f71c4c0]: Jan 28 17:03:47 DLM initialisation complete CLVMD[4f71c4c0]: Jan 28 17:03:47 Cluster ready, doing some more initialisation CLVMD[4f71c4c0]: Jan 28 17:03:47 starting LVM thread CLVMD[4f71c4c0]: Jan 28 17:03:47 clvmd ready for work CLVMD[4f71c4c0]: Jan 28 17:03:47 Using timeout of 60 seconds CLVMD[41001940]: Jan 28 17:03:47 LVM thread function started Logging initialised at Mon Jan 28 17:03:47 2008 Set umask to 0077 CLVMD[41001940]: Jan 28 17:03:47 LVM thread waiting for work CLVMD[4f71c4c0]: Jan 28 17:04:49 Got port closed message, removing node sbia-infr2-admin.uphs.upenn.edu CLVMD[4f71c4c0]: Jan 28 17:05:04 add_to_lvmqueue: cmd=0x884bc0. client=0x65c8a0, msg=0x7fff5b3a91ac, len=30, csid=0x7fff5b3a90f4, xid=0 CLVMD[41001940]: Jan 28 17:05:04 process_work_item: remote CLVMD[41001940]: Jan 28 17:05:04 process_remote_command 2 for clientid 0x0 XID 128 on node sbia-infr2-admin.uphs.upenn.edu CLVMD[41001940]: Jan 28 17:05:04 Remote node sbia-infr2-admin.uphs.upenn.edu is version 0.2.1 CLVMD[41001940]: Jan 28 17:05:04 Added new node 1 to updown list CLVMD[41001940]: Jan 28 17:05:04 LVM thread waiting for work [AT THIS POINT, THERE IS NO OUTPUT FROM clvmd IN RESPONSE TO "pvdisplay"] ============================================================================= ================ Node2 ==================================================== [root@sbia-infr2 lvm]# /etc/init.d/clvmd start Starting clvmd: [ OK ] Activating VGs: Logging initialised at Mon Jan 28 16:01:41 2008 Set umask to 0077 Finding all volume groups Finding volume group "home_vg" Activated logical volumes in volume group "home_vg" 1 logical volume(s) in volume group "home_vg" now active Finding volume group "cluster_shared_vg" Activated logical volumes in volume group "cluster_shared_vg" 1 logical volume(s) in volume group "cluster_shared_vg" now active Finding volume group "sbiaprj_vg" Activated logical volumes in volume group "sbiaprj_vg" 1 logical volume(s) in volume group "sbiaprj_vg" now active Finding volume group "comp_space_vg" Activated logical volumes in volume group "comp_space_vg" 1 logical volume(s) in volume group "comp_space_vg" now active Finding volume group "archive_vg" Activated logical volumes in volume group "archive_vg" 1 logical volume(s) in volume group "archive_vg" now active Finding volume group "VolGroup00" 2 logical volume(s) in volume group "VolGroup00" already active 2 existing logical volume(s) in volume group "VolGroup00" monitored Activated logical volumes in volume group "VolGroup00" 2 logical volume(s) in volume group "VolGroup00" now active Wiping internal VG cache [root@sbia-infr2 lvm]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_START_WAIT 2 200020001 1 [1 2] dlm 1 rgmanager 00030001 none [1 2] dlm 1 clvmd 00050001 none [1 2] [root@sbia-infr2 mpath]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ sbia-infr2-admin.uphs.upenn.edu 1 Online, Local, rgmanager sbia-infr1-admin.uphs.upenn.edu 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:vweb (none) stopped [root@sbia-infr2]# clvmd -d [ISSUING "pvdisplay" IN ANOTHER WINDOW ON THE SAME SERVER PRODUCES THE FOLLOWING DEBUGGING OUTPUT FROM clvmd] CLVMD[3b52c4c0]: Jan 28 17:05:03 CLVMD started CLVMD[3b52c4c0]: Jan 28 17:05:03 Connected to CMAN CLVMD[3b52c4c0]: Jan 28 17:05:03 CMAN initialisation complete CLVMD[3b52c4c0]: Jan 28 17:05:04 DLM initialisation complete CLVMD[3b52c4c0]: Jan 28 17:05:04 Cluster ready, doing some more initialisation CLVMD[3b52c4c0]: Jan 28 17:05:04 starting LVM thread CLVMD[41001940]: Jan 28 17:05:04 LVM thread function started CLVMD[3b52c4c0]: Jan 28 17:05:04 clvmd ready for work CLVMD[3b52c4c0]: Jan 28 17:05:04 Using timeout of 60 seconds Logging initialised at Mon Jan 28 17:05:04 2008 Set umask to 0077 File descriptor 5 left open Logging initialised at Mon Jan 28 17:05:04 2008 Finding all logical volumes Wiping internal VG cache CLVMD[41001940]: Jan 28 17:05:04 LVM thread waiting for work CLVMD[3b52c4c0]: Jan 28 17:08:48 Got new connection on fd 9 CLVMD[3b52c4c0]: Jan 28 17:08:48 Read on local socket 9, len = 30 CLVMD[3b52c4c0]: Jan 28 17:08:48 creating pipe, [10, 11] CLVMD[3b52c4c0]: Jan 28 17:08:48 Creating pre&post thread CLVMD[3b52c4c0]: Jan 28 17:08:48 Created pre&post thread, state = 0 CLVMD[41802940]: Jan 28 17:08:48 in sub thread: client = 0x884bc0 CLVMD[41802940]: Jan 28 17:08:48 Sub thread ready for work. CLVMD[41802940]: Jan 28 17:08:48 doing PRE command LOCK_VG 'V_home_vg' at 1 (client=0x884bc0) CLVMD[41802940]: Jan 28 17:08:48 sync_lock: 'V_home_vg' mode:3 flags=0 CLVMD[41802940]: Jan 28 17:08:48 sync_lock: returning lkid 430001 CLVMD[41802940]: Jan 28 17:08:48 Writing status 0 down pipe 11 CLVMD[41802940]: Jan 28 17:08:48 Waiting to do post command - state = 0 CLVMD[3b52c4c0]: Jan 28 17:08:48 read on PIPE 10: 4 bytes: status: 0 CLVMD[3b52c4c0]: Jan 28 17:08:48 background routine status was 0, sock_client=0x884bc0 CLVMD[3b52c4c0]: Jan 28 17:08:48 distribute command: XID = 0 CLVMD[3b52c4c0]: Jan 28 17:08:48 add_to_lvmqueue: cmd=0x884ff0. client=0x884bc0, msg=0x8785d0, len=30, csid=(nil), xid=0 CLVMD[41001940]: Jan 28 17:08:48 process_work_item: local CLVMD[41001940]: Jan 28 17:08:48 process_local_command: msg=0x885030, msglen =30, client=0x884bc0 [HUNDREDS OF LINES OF OUTPUT DELETED...] ============================================================================= ----- Mark Bergman http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster