Unexpected behaviour during replication heal

darren-lists at widgit.com (Darren Austin) · Wed, 22 Jun 2011 15:01:03 +0100 (BST)

Hi,
  I've been evaluating GlusterFS (3.2.0) for a small replicated cluster set up on Amazon EC2, and I think i've found what might be a bug or some sort of unexpected behaviour during the self-heal process.

Here's the volume info[1]:
Volume Name: test-volume
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 1.2.3.4:/data
Brick2: 1.2.3.5:/data

I've not configured any special volume settings, or modified any .vol files by hand; and the glusterd.vol file is the one installed from the source package - so it's a pretty bog standard set up i'm testing.

I've been simulating the complete hard failure of one of the servers within the cluster (IP 1.2.3.4) in order to test the replication recovery side of Gluster.  From a client i'm copying a few (pre-made) large files (1GB+) of random data onto the mount, and part way through using iptables on the server at IP 1.2.3.4 to simulate it falling off the planet (basically dropping ALL outgoing and incoming packets from all the clients/peers).

The client's seem to handle this fine - after a short pause in the copy, they continue to write the data to the second replicate server, which dutifully stores the data.  An md5sum of the files from clients shows they are getting the complete file back from the (one) server in the cluster - so all is good thus far :)

Now, when I pull the firewall down on the gluster server I took down earlier (allowing clients and peers to communicate with it again), that server has only some of the files which were copied and *part* of a file which it received before it got disconnected.

The client logs show that a self-heal process has been triggered, but nothing seem to happen *at all* to bring the replicas back into sync.  So I tested a few things in this situation to discover what the procedure might be to recover from this once we have a live system.

On the client, I go into the gluster mounted directory and do an 'ls -al'.  This triggers a partial re-sync of the brick on the peer which was inaccessible for a while - the missing files are created in the brick as ZERO size; no data is transferred from the other replica into those files and the partial file which that brick holds does not have any of the missing part copied into it.

The 'ls -al' on the client lists ALL the files that were copied into the cluster (as you'd expect), and the files have the correct size information except for 1 - the file which was being actively written when I downed the peer at IP 1.2.3.4.
That file's size is listed as the partial size of the file held on the disconnected peer - it is not reporting the full size as held by the peer with the complete file.  However, an md5sum of the file is correct - the whole file is being read back from the peer which has it, even though the size information is wrong.  A stat, touch or any other access of that file does not cause it to be synced with the brick which only has the partial copy.

I now try the 'self-heal' trigger as documented on the website.  A bit more success!  All the zero sized files on the peer at 1.2.3.4 are now having data copied into them from the brick which has the full set of files.
All the files are now in sync between the bricks except one - the partial file which was being written to at the time the peer went down.  The peer at 1.2.3.4 still only has the partial file, the peer at 1.2.3.5 has the full file, and all the clients report the size as being the partial size held by the peer at 1.2.3.4, but can md5sum the file and get the correct result.
No matter how much that file is accessed, it will not sync over to the other peer.

So I tried a couple more things to see if I could trigger the sync... From another client (NOT the one which performed the copy of files onto the cluster), I umount'ed and re-mount'ed the volume.  Further stat's, md5sum's, etc still do not trigger the sync.

However, if I umount and re-mount the volume on the client which actually performed the copy procedure; as soon as I do an ls in the directory with that file in it, the sync begins.  I don't even have to touch the file itself - a simple ls on the directory is all it takes to trigger.  The size of the file is then correctly reported to the client also.

This isn't a split-brain situation since the file on the peer at 1.2.3.4 is NOT being modified while it's out of the cluster - it's just got one or two whole files from the client, plus a partial one cut off during transfer.

I'd be very grateful if someone could confirm if this is expected behaviour of the cluster or not?

To me, it seems unthinkable that a volume would have the triggered to repair (with the find/stat commands), plus be umount'ed and re-mount'ed by the exact client which was writing the partial file at the time, in order to force it to be sync'ed.

If this is a bug, it's a pretty impressive one in terms of reliability of cluster - what would happen if the peer which DOES have the full file goes down before the above procedure is complete?  The first peer still only has the partial file, yet the clients will believe the whole file has been written to the volume - causing an inconsistent state and possible data corruption.

Thanks for reading such a long message - please let me know if you need any more info to help explain why it's doing this! :)

Cheers,

Darren.

[1] - Please, please can you make 'volume status' an alias for 'volume info', and 'peer info' an alias for 'peer status'?!  I keep typing them the wrong way around! :)