Dear linux-cluster,
I have made some observations about the behaviour of gfs2 and would
appreciate confirmation of whether this is expected behaviour or
something has gone wrong.
I have a three-node cluster -- let's call the nodes A, B and C. On each
of nodes A and B, I have a loop that repeatedly writes an increasing
integer value to a file in the GFS2-mountpoint. On node C, I have a loop
that reads from both these files from the GFS2-mountpoint. The reads on
node C show the latest values written by A and B, and stay up-to-date.
All good so far.
I then cause node A to drop the corosync heartbeat by executing the
following on node A:
iptables -I INPUT -p udp --dport 5404 -j DROP
iptables -I INPUT -p udp --dport 5405 -j DROP
iptables -I INPUT -p tcp --dport 21064 -j DROP
After a few seconds, I normally observe that all I/O to the GFS2
filesystem hangs forever on node A: the latest value read by node C is
the same as the last successful write by node A. This is exactly the
behaviour I want -- I want to be sure that node A never completes I/O
that is not able to be seen by other nodes.
However, on some occasions, I observe that node A continues in the loop
believing that it is successfully writing to the file but, according to
node C, the file stops being updated. (Meanwhile, the file written by
node B continues to be up-to-date as read by C.) This is concerning --
it looks like I/O writes are being completed on node A even though other
nodes in the cluster cannot see the results.
I performed this test 20 times, rebooting node A between each, and saw
the "I/O hanging" behaviour 16 times and the "I/O appears to continue"
behaviour 4 times. I couldn't see anything that might cause it to
sometimes adopt one behaviour and sometimes the other.
So... is this expected? Should I be able to rely upon I/O hanging? Or
have I misconfigured something? Advice would be appreciated.
Thanks,
Jonathan
Notes:
* The I/O from node A uses an fd that is O_DIRECT|O_SYNC, so the page
cache is not involved.
* Versions: corosync 2.3.4, dlm_controld 4.0.2, gfs2 as per RHEL 7.2.
* I don't see anything particularly useful being logged. Soon after I
insert the iptables rules on node A, I see the following on node A:
2016-04-15T14:15:45.608175+00:00 localhost corosync[3074]: [TOTEM ] The
token was lost in the OPERATIONAL state.
2016-04-15T14:15:45.608191+00:00 localhost corosync[3074]: [TOTEM ] A
processor failed, forming new configuration.
2016-04-15T14:15:45.608198+00:00 localhost corosync[3074]: [TOTEM ]
entering GATHER state from 2(The token was lost in the OPERATIONAL state.).
Around the time node C sees the output from node A stop changing, node A
reports:
2016-04-15T14:15:58.388404+00:00 localhost corosync[3074]: [TOTEM ]
entering GATHER state from 0(consensus timeout).
* corosync.conf:
totem {
version: 2
secauth: off
cluster_name: 1498d523
transport: udpu
token_retransmits_before_loss_const: 10
token: 10000
}
logging {
debug: on
}
quorum {
provider: corosync_votequorum
}
nodelist {
node {
ring0_addr: 10.220.73.6
}
node {
ring0_addr: 10.220.73.7
}
node {
ring0_addr: 10.220.73.3
}
}
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster