Hello,
So I resolved my previous issue with split-brains and the lack
of self-healing by dropping my installed glusterfs* packages
from 3.6.2 to 3.5.3, but now I've picked up a new issue, which
actually makes normal use of the volume practically
impossible.
A little background for those not already paying close
attention:
I have a 2 node 2 brick replicating volume whose purpose in
life is to hold iSCSI target files, primarily for use to
provide datastores to a VMware ESXi cluster. The plan is to
put a handful of image files on the Gluster volume, mount them
locally on both Gluster nodes, and run tgtd on both, pointed
to the image files on the mounted gluster volume. Then the
ESXi boxes will use multipath (active/passive) iSCSI to
connect to the nodes, with automatic failover in case of
planned or unplanned downtime of the Gluster nodes.
In my most recent round of testing with 3.5.3, I'm seeing a
massive failure to write data to the volume after about 5-10
minutes, so I've simplified the scenario a bit (to minimize
the variables) to: both Gluster nodes up, only one node (duke)
mounted and running tgtd, and just regular (single path) iSCSI
from a single ESXi server.
About 5-10 minutes into migration a VM onto the test
datastore, /var/log/messages on duke gets blasted with a ton
of messages exactly like this:
Mar 15 22:24:06 duke tgtd:
bs_rdwr_request(180) io
error 0x1781e00 2a -1 512 22971904, Input/output error
And
/var/log/glusterfs/mnt-gluster_disk.log gets blased with a
ton of messages exactly like this:
[2015-03-16 02:24:07.572279] W
[fuse-bridge.c:2242:fuse_writev_cbk] 0-glusterfs-fuse:
635299: WRITE => -1 (Input/output error)