Gordan Bobic wrote:
As I thought, the problem I'm seeing is indeed rather multi-part. The
first part is now resolved - large time-skips due to the system clock
being out of date until ntpd syncs it up. It seems that large time jumps
made dlm choke.
Now for part 2:
The two nodes connect - certainly enough to sync up DRBD. That stage
goes through fine. They start cman and other cluster components, but it
would appear then never actually find each other.
When mounting the shared file system:
Node 1:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs
GFS: fsid=sentinel:root.0: Done
Node 2:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs
GFS: fsid=sentinel:root.0: Done
Unless I'm reading this wrong, they are both trying to use JID 0.
The second node to join generally chokes at some point during the boot,
but AFTER it mounted the GFS volume. On the booted node, cman_tool
status says:
# cman_tool status
Version: 6.0.1
Config Version: 20
Cluster Name: sentinel
Cluster Id: 28150
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: sentinel1c
Node ID: 1
Multicast addresses: 239.192.109.100
Node addresses: 10.0.0.1
So the second node never joined.
I know for a fact that the network connection between them is working,
as they sync DRBD.
cluster.conf is here:
<?xml version="1.0"?>
<cluster config_version="20" name="sentinel">
<cman two_node="1" expected_votes="1"/>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="sentinel1c" nodeid="1" votes="1">
<com_info>
<rootsource name="drbd"/>
<!--<chrootenv mountpoint =
"/var/comoonics/chroot"
fstype = "ext3"
device =
"/dev/sda2"
chrootdir =
"/var/comoonics/chroot"
/>-->
<syslog name="localhost"/>
<rootvolume name =
"/dev/drbd1"
mountopts =
"defaults,noatime,nodiratime,noquota"
/>
<eth name = "eth0"
ip = "10.0.0.1"
mac = "00:0B:DB:92:C5:E1"
mask = "255.255.255.0"
gateway = ""
/>
<fenceackserver user = "root"
passwd = "password"
/>
</com_info>
<fence>
<method name = "1">
<device name = "sentinel1d"/>
</method>
</fence>
</clusternode>
<clusternode name="sentinel2c" nodeid="2" votes="1">
<com_info>
<rootsource name="drbd"/>
<!--<chrootenv mountpoint =
"/var/comoonics/chroot"
fstype = "ext3"
device =
"/dev/sda2"
chrootdir =
"/var/comoonics/chroot"
/>-->
<syslog name="localhost"/>
<rootvolume name =
"/dev/drbd1"
mountopts =
"defaults,noatime,nodiratime,noquota"
/>
<eth name = "eth0"
ip = "10.0.0.2"
mac = "00:0B:DB:90:4E:1B"
mask = "255.255.255.0"
gateway = ""
/>
<fenceackserver user = "root"
passwd = "password"
/>
</com_info>
<fence>
<method name = "1">
<device name = "sentinel2d"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_drac" ipaddr="192.168.254.252"
login="root" name="sentinel1d" passwd="password"/>
<fencedevice agent="fence_drac" ipaddr="192.168.254.253"
login="root" name="sentinel2d" passwd="password"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
What could be causing the nodes to not join in the cluster?
A bit of additional information. When both nodes come up at the same
time, they actually sort out the journals between them correctly. One
gets 0, the other 1.
But almost immediately afterwards, this happens on the 2nd node:
dlm: closing connection to node 1
dlm: connect from non cluster node
shortly followed by DRBD keeling over:
drbd1: Handshake successful: DRBD Network Protocol version 86
drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
drbd1: conn( WFConnection -> WFReportParams )
drbd1: Discard younger/older primary did not found a decision
Using discard-least-changes instead
drbd1: State change failed: Device is held open by someone
drbd1: state = { cs:WFReportParams st:Primary/Unknown
ds:UpToDate/DUnknown r--
- }
drbd1: wanted = { cs:WFReportParams st:Secondary/Unknown
ds:UpToDate/DUnknown r
--- }
drbd1: helper command: /sbin/drbdadm pri-lost-after-sb
drbd1: Split-Brain detected, dropping connection!
drbd1: self
866625728B4E10B9:E4C3366683AFBC6B:ED24F75CC7B3F4A5:EFFAB6EF6A3CC469
drbd1: peer
572F799325FDF21D:E4C3366683AFBC6B:ED24F75CC7B3F4A4:EFFAB6EF6A3CC469
drbd1: conn( WFReportParams -> Disconnecting )
drbd1: helper command: /sbin/drbdadm split-brain
drbd1: error receiving ReportState, l: 4!
drbd1: asender terminated
drbd1: tl_clear()
drbd1: Connection closed
drbd1: conn( Disconnecting -> StandAlone )
drbd1: receiver terminated
At this point the 1st node seems to lock up, but despite fencing being
set up, the 2nd node doesn't get powered down. The fencing device is a
DRAC III ERA/O. Rebooting the 2nd node makes things revert back to it
trying to use JID 0, which is already used by the 1st node, and things
go wrong again.
I'm sure I must be missing something obvious here, but for the life of
me I cannot see what.
Gordan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster