Unexpected behaviour during replication heal

anand.avati at gmail.com (Anand Avati) · Wed, 22 Jun 2011 23:44:11 +0530

It looks like the disconnection happened in the middle of a write
transaction (after the lock phase, before the unlock phase). And the
server's detection of client disconnection (via TCP_KEEPALIVE) seems to have
not happened before the client reconnected. The client, having witnessed the
reconnection has assumed the locks have been relinquished by the server. The
server, however, having noticed the same client reconnection before breakage
of the original connection has not released the held locks. This explains
why self-heal happens only after the first client unmounts while
connectivity is fine. The removal of locks on the inode permits self-heal to
proceed.

Tuning the server side tcp keepalive to a smaller value should fix this
problem. Can you please verify?

Thanks,
Avati

On Wed, Jun 22, 2011 at 7:31 PM, Darren Austin <darren-lists at widgit.com>wrote:

> Hi,
>  I've been evaluating GlusterFS (3.2.0) for a small replicated cluster set
> up on Amazon EC2, and I think i've found what might be a bug or some sort of
> unexpected behaviour during the self-heal process.
>
> Here's the volume info[1]:
> Volume Name: test-volume
> Type: Replicate
> Status: Started
> Number of Bricks: 2
> Transport-type: tcp
> Bricks:
> Brick1: 1.2.3.4:/data
> Brick2: 1.2.3.5:/data
>
> I've not configured any special volume settings, or modified any .vol files
> by hand; and the glusterd.vol file is the one installed from the source
> package - so it's a pretty bog standard set up i'm testing.
>
> I've been simulating the complete hard failure of one of the servers within
> the cluster (IP 1.2.3.4) in order to test the replication recovery side of
> Gluster.  From a client i'm copying a few (pre-made) large files (1GB+) of
> random data onto the mount, and part way through using iptables on the
> server at IP 1.2.3.4 to simulate it falling off the planet (basically
> dropping ALL outgoing and incoming packets from all the clients/peers).
>
> The client's seem to handle this fine - after a short pause in the copy,
> they continue to write the data to the second replicate server, which
> dutifully stores the data.  An md5sum of the files from clients shows they
> are getting the complete file back from the (one) server in the cluster - so
> all is good thus far :)
>
> Now, when I pull the firewall down on the gluster server I took down
> earlier (allowing clients and peers to communicate with it again), that
> server has only some of the files which were copied and *part* of a file
> which it received before it got disconnected.
>
> The client logs show that a self-heal process has been triggered, but
> nothing seem to happen *at all* to bring the replicas back into sync.  So I
> tested a few things in this situation to discover what the procedure might
> be to recover from this once we have a live system.
>
> On the client, I go into the gluster mounted directory and do an 'ls -al'.
>  This triggers a partial re-sync of the brick on the peer which was
> inaccessible for a while - the missing files are created in the brick as
> ZERO size; no data is transferred from the other replica into those files
> and the partial file which that brick holds does not have any of the missing
> part copied into it.
>
> The 'ls -al' on the client lists ALL the files that were copied into the
> cluster (as you'd expect), and the files have the correct size information
> except for 1 - the file which was being actively written when I downed the
> peer at IP 1.2.3.4.
> That file's size is listed as the partial size of the file held on the
> disconnected peer - it is not reporting the full size as held by the peer
> with the complete file.  However, an md5sum of the file is correct - the
> whole file is being read back from the peer which has it, even though the
> size information is wrong.  A stat, touch or any other access of that file
> does not cause it to be synced with the brick which only has the partial
> copy.
>
> I now try the 'self-heal' trigger as documented on the website.  A bit more
> success!  All the zero sized files on the peer at 1.2.3.4 are now having
> data copied into them from the brick which has the full set of files.
> All the files are now in sync between the bricks except one - the partial
> file which was being written to at the time the peer went down.  The peer at
> 1.2.3.4 still only has the partial file, the peer at 1.2.3.5 has the full
> file, and all the clients report the size as being the partial size held by
> the peer at 1.2.3.4, but can md5sum the file and get the correct result.
> No matter how much that file is accessed, it will not sync over to the
> other peer.
>
> So I tried a couple more things to see if I could trigger the sync... From
> another client (NOT the one which performed the copy of files onto the
> cluster), I umount'ed and re-mount'ed the volume.  Further stat's, md5sum's,
> etc still do not trigger the sync.
>
> However, if I umount and re-mount the volume on the client which actually
> performed the copy procedure; as soon as I do an ls in the directory with
> that file in it, the sync begins.  I don't even have to touch the file
> itself - a simple ls on the directory is all it takes to trigger.  The size
> of the file is then correctly reported to the client also.
>
> This isn't a split-brain situation since the file on the peer at 1.2.3.4 is
> NOT being modified while it's out of the cluster - it's just got one or two
> whole files from the client, plus a partial one cut off during transfer.
>
> I'd be very grateful if someone could confirm if this is expected behaviour
> of the cluster or not?
>
> To me, it seems unthinkable that a volume would have the triggered to
> repair (with the find/stat commands), plus be umount'ed and re-mount'ed by
> the exact client which was writing the partial file at the time, in order to
> force it to be sync'ed.
>
> If this is a bug, it's a pretty impressive one in terms of reliability of
> cluster - what would happen if the peer which DOES have the full file goes
> down before the above procedure is complete?  The first peer still only has
> the partial file, yet the clients will believe the whole file has been
> written to the volume - causing an inconsistent state and possible data
> corruption.
>
> Thanks for reading such a long message - please let me know if you need any
> more info to help explain why it's doing this! :)
>
> Cheers,
>
> Darren.
>
> [1] - Please, please can you make 'volume status' an alias for 'volume
> info', and 'peer info' an alias for 'peer status'?!  I keep typing them the
> wrong way around! :)
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110622/b7ffc0ca/attachment-0001.htm>