Re: GFS + DRBD Problems

Gordan Bobic <gordan@xxxxxxxxxx> · Tue, 04 Mar 2008 22:35:21 +0000

Gordan Bobic wrote:
As I thought, the problem I'm seeing is indeed rather multi-part. The 
first part is now resolved - large time-skips due to the system clock 
being out of date until ntpd syncs it up. It seems that large time jumps 
made dlm choke.

Now for part 2:

The two nodes connect - certainly enough to sync up DRBD. That stage 
goes through fine. They start cman and other cluster components, but it 
would appear then never actually find each other.

When mounting the shared file system:

Node 1:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs
GFS: fsid=sentinel:root.0: Done

Node 2:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs
GFS: fsid=sentinel:root.0: Done

Unless I'm reading this wrong, they are both trying to use JID 0.

The second node to join generally chokes at some point during the boot, 
but AFTER it mounted the GFS volume. On the booted node, cman_tool 
status says:

# cman_tool status
Version: 6.0.1
Config Version: 20
Cluster Name: sentinel
Cluster Id: 28150
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: sentinel1c
Node ID: 1
Multicast addresses: 239.192.109.100
Node addresses: 10.0.0.1

So the second node never joined.
I know for a fact that the network connection between them is working, 
as they sync DRBD.

cluster.conf is here:

<?xml version="1.0"?>
<cluster config_version="20" name="sentinel">
        <cman two_node="1" expected_votes="1"/>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="sentinel1c" nodeid="1" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
                                <!--<chrootenv  mountpoint      = 
"/var/comoonics/chroot"
                                                fstype          = "ext3"
                                                device          = 
"/dev/sda2"
                                                chrootdir       = 
"/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
                                <rootvolume     name            = 
"/dev/drbd1"
                                                mountopts       = 
"defaults,noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.1"
                                        mac     = "00:0B:DB:92:C5:E1"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "password"
                                />
                        </com_info>
                        <fence>
                                <method name = "1">
                                        <device name = "sentinel1d"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="sentinel2c" nodeid="2" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
                                <!--<chrootenv  mountpoint      = 
"/var/comoonics/chroot"
                                                fstype          = "ext3"
                                                device          = 
"/dev/sda2"
                                                chrootdir       = 
"/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
                                <rootvolume     name            = 
"/dev/drbd1"
                                                mountopts       = 
"defaults,noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.2"
                                        mac     = "00:0B:DB:90:4E:1B"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "password"
                                />
                        </com_info>
                        <fence>
                                <method name = "1">
                                        <device name = "sentinel2d"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="192.168.254.252" 
login="root" name="sentinel1d" passwd="password"/>
                <fencedevice agent="fence_drac" ipaddr="192.168.254.253" 
login="root" name="sentinel2d" passwd="password"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

What could be causing the nodes to not join in the cluster?

A bit of additional information. When both nodes come up at the same 
time, they actually sort out the journals between them correctly. One 
gets 0, the other 1.

But almost immediately afterwards, this happens on the 2nd node:
dlm: closing connection to node 1
dlm: connect from non cluster node

shortly followed by DRBD keeling over:

drbd1: Handshake successful: DRBD Network Protocol version 86
drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
drbd1: conn( WFConnection -> WFReportParams )
drbd1: Discard younger/older primary did not found a decision
Using discard-least-changes instead
drbd1: State change failed: Device is held open by someone
drbd1:   state = { cs:WFReportParams st:Primary/Unknown 
ds:UpToDate/DUnknown r--
- }
drbd1:  wanted = { cs:WFReportParams st:Secondary/Unknown 
ds:UpToDate/DUnknown r
--- }
drbd1: helper command: /sbin/drbdadm pri-lost-after-sb
drbd1: Split-Brain detected, dropping connection!
drbd1: self 
866625728B4E10B9:E4C3366683AFBC6B:ED24F75CC7B3F4A5:EFFAB6EF6A3CC469
drbd1: peer 
572F799325FDF21D:E4C3366683AFBC6B:ED24F75CC7B3F4A4:EFFAB6EF6A3CC469
drbd1: conn( WFReportParams -> Disconnecting )
drbd1: helper command: /sbin/drbdadm split-brain
drbd1: error receiving ReportState, l: 4!
drbd1: asender terminated
drbd1: tl_clear()
drbd1: Connection closed
drbd1: conn( Disconnecting -> StandAlone )
drbd1: receiver terminated

At this point the 1st node seems to lock up, but despite fencing being 
set up, the 2nd node doesn't get powered down. The fencing device is a 
DRAC III ERA/O. Rebooting the 2nd node makes things revert back to it 
trying to use JID 0, which is already used by the 1st node, and things 
go wrong again.

I'm sure I must be missing something obvious here, but for the life of 
me I cannot see what.

Gordan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster