file locked / inaccessible if auto-heal required & confusing log messages (1.4rc3)

dma+gluster at witbe.net (Daniel Maher) · Wed, 17 Dec 2008 15:40:44 +0100

Hello,

I recently upgraded my infrastructure from a 1.3.12 server-based AFR 
cluster to a 1.4rc3 client-based AFR cluster.  Among other things, i 
have noticed one very obvious change in the behaviour of self-healing 
between the two setups...

The scenario is basic : one of the server nodes becomes inaccessible, 
and as a result, changes to a given file are not replicated.  When the 
downed node returns, and the file is accessed, the self-heal feature is 
triggered, thus ensuring the integrity of the data across all server nodes.

So far so good ; however, between the previous setup and that of the 
current, ? something ? has resulted in differing behaviour vis-?-vis the 
availability of said file.

In the previous 1.3 server-based AFR setup, if a client attempted to 
write to the file, it was able to do so, with the change being 
replicated to the newly-returned node as part of the self-heal process. 
  Perfect.

However, in the current 1.4 client-based AFR setup, if a client attempts 
to write to the file, instead of Gluster accepting the write and 
propagating the change during the self-heal, the file becomes 
momentarily inaccessible.  The self-heal process is then triggered, and 
the file - without the current attempted write - is replicated. 
Subsequent accesses are successful (and replicate as expected), but that 
? triggering write ? still fails the first time.

Furthermore, the log entry related to this particular process is 
confusing (log excerpt below).  It follows the form :
1. Self-heal triggered
2. Unable to resolve conflicting data
3. Self-heal completed
4. File not found

The reported conflict does not, in fact, appear to affect the self-heal, 
in that the file is replicated as expected.  Is the error itself 
erroneous, or is there actually a problem ?  Furthermore, even though 
the file clearly exists, and has in fact just been replicated, Gluster 
reports then throws an error on OPEN.

This can't possibly be the expected behaviour.  What within the 
underlying infrastructure has changed ?  How can it be fixed ?

Some log snippets :

Tomcat (on client)
---------
[Thread-25]09:32:09,398 ERROR: Error in copyfile.
java.io.FileNotFoundException: /glusterfs/some/directory/somefile.txt 
(Input/output error)
---------

glusterfs.log (on client)
---------
2008-12-17 09:32:09 W [afr-self-heal-common.c:1005:afr_self_heal] 
nasdash-afr: performing self heal on 
/glusterfs/some/directory/somefile.txt (metadata=0 data=1 entry=0)
2008-12-17 09:32:09 E [afr-self-heal-data.c:777:afr_sh_data_fix] 
nasdash-afr: Unable to resolve conflicting data of 
/glusterfs/some/directory/somefile.txt. Please resolve manually by 
deleting the file /glusterfs/some/directory/somefile.txt from all but 
the preferred subvolume
2008-12-17 09:32:09 W [afr-self-heal-data.c:70:afr_sh_data_done] 
nasdash-afr: self heal of /glusterfs/some/directory/somefile.txt completed
2008-12-17 09:32:09 E [fuse-bridge.c:662:fuse_fd_cbk] glusterfs-fuse: 
189804: OPEN() /glusterfs/some/directory/somefile.txt => -1 
(Input/output error)
---------

Comments ?

-- 
Daniel Maher <dma+gluster AT witbe DOT net>