Recovering out of sync nodes from input/output error

jdarcy at redhat.com (Jeff Darcy) · Thu, 12 Apr 2012 08:49:45 -0400

On 04/11/2012 07:00 AM, Alex Florescu wrote:
> Simulation follows:
> step 1
> node1:
> iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation)
> touch /a/howareyou
> 
> node2:
> touch /a/hello
> 
> step 2
> node1:
> iptables -D INPUT 1 (connectivity recovery)
> ls /a
> ls: cannot access /a: Input/output error
> 
> node2:
> ls /a
> ls: cannot access /a: Input/output error

I was able to reproduce this on my own setup using packages built from git,
which has a bit of a surprise TBH.  I'll look into it, but here are some
observations that might suggest workarounds.

(1) To a first approximation, it should be safe to "merge" directory contents
despite there being a split-brain problem, by healing any file that exists on
only one brick from there to its peer(s).  This contrasts with the case for
file contents, where - as Robert points out - we can't determine the correct
thing to do and would risk overwriting data.  Directory entries differ from
file contents in a small but important way: they're sets, not arrays.  If
something's not in the set, there's no danger that adding it will overwrite
anything.

(2) That said, the case you've created is indistinguishable from the case where
"hello" and "howareyou" used to exist on both bricks and each *deleted* one
while they couldn't communicate.  Unconditionally recreating the files would
effectively undo those deletes, which many would consider an error as serious
as overwriting data.  It would not be valid for such merge behavior to kick in
unconditionally.  At the very least, there should be a configuration option for it.

(3) The reason you continue to get I/O errors is probably that the xattrs on
the *parent directory* still indicate pending operations on both sides.  You
can verify this with the following command on each brick:

	getfattr -d -e hex -n trusted.glusterfs.dht /a

The format of this value is described here:

	http://hekafs.org/index.php/2011/04/glusterfs-extended-attributes/

If the result is non-zero (most likely in the last four-byte integer indicating
a directory-entry operation) then that confirms our theory.  It should be safe
for the self-heal code to clear these counts if (and only if) the directories
are checked and found identical.  In fact, I think we already do this.  Thus,
manual copying of files followed by self-heal on the parent directory should
make the errors go away.  I encourage you to try that while I go look at the code.