file locked / inaccessible if auto-heal required & confusing log messages (1.4rc3)

freedman at FreeFormIT.com (Keith Freedman) · Wed, 17 Dec 2008 12:25:37 -0800

perhaps an option to afr to automatically handle 
such situations would be helpful?
something along the lines of
option split-brain-master-override on
option split-brain-master-override-volume (BRICK NAME)
OR
option split-brain-preferred-mirror (BRICK NAME)

in some cases it's just not practical to have to 
go sift through logs and find these case.

Also it seems, from what Daniel said, subsequent 
access to the file are auto-healing.
does this mean that the data changed on one of 
them thus making it the newest version and so auto-healing would work again?

Keith

At 10:02 AM 12/17/2008, Krishna Srinivas wrote:
>Daniel,
>
>Here the "selfheal complete" is acutaly "selfheal complete
>unsuccessfully". It does not heal the file and open returns error. The
>healing code detects a conflicting case if it sees that both the
>subvols say that they are the latest and other is outdated. We see
>this happen when there is a split brain situation (network between AFR
>servers is broken and different clients write to each AFR
>independently) or in a very rare case where one of the servers go down
>right when a write operation is happening. I think you have hit the
>2nd case. Here AFR can not really decide on which subvol has the
>latest version hence it leaves is to the discretion of the user.
>Earlier 1.3 AFR did not handle the split brain situation hence you did
>not see this.
>
>Krishna
>
>On Wed, Dec 17, 2008 at 8:10 PM, Daniel Maher <dma+gluster at witbe.net> wrote:
> > Hello,
> >
> > I recently upgraded my infrastructure from a 1.3.12 server-based AFR
> > cluster to a 1.4rc3 client-based AFR cluster.  Among other things, i
> > have noticed one very obvious change in the behaviour of self-healing
> > between the two setups...
> >
> > The scenario is basic : one of the server nodes becomes inaccessible,
> > and as a result, changes to a given file are not replicated.  When the
> > downed node returns, and the file is accessed, the self-heal feature is
> > triggered, thus ensuring the integrity of the data across all server nodes.
> >
> > So far so good ; however, between the previous setup and that of the
> > current, ? something ? has resulted in differing behaviour vis-?-vis the
> > availability of said file.
> >
> > In the previous 1.3 server-based AFR setup, if a client attempted to
> > write to the file, it was able to do so, with the change being
> > replicated to the newly-returned node as part of the self-heal process.
> >  Perfect.
> >
> > However, in the current 1.4 client-based AFR setup, if a client attempts
> > to write to the file, instead of Gluster accepting the write and
> > propagating the change during the self-heal, the file becomes
> > momentarily inaccessible.  The self-heal process is then triggered, and
> > the file - without the current attempted write - is replicated.
> > Subsequent accesses are successful (and replicate as expected), but that
> > ? triggering write ? still fails the first time.
> >
> > Furthermore, the log entry related to this particular process is
> > confusing (log excerpt below).  It follows the form :
> > 1. Self-heal triggered
> > 2. Unable to resolve conflicting data
> > 3. Self-heal completed
> > 4. File not found
> >
> > The reported conflict does not, in fact, appear to affect the self-heal,
> > in that the file is replicated as expected.  Is the error itself
> > erroneous, or is there actually a problem ?  Furthermore, even though
> > the file clearly exists, and has in fact just been replicated, Gluster
> > reports then throws an error on OPEN.
> >
> > This can't possibly be the expected behaviour.  What within the
> > underlying infrastructure has changed ?  How can it be fixed ?
> >
> >
> > Some log snippets :
> >
> > Tomcat (on client)
> > ---------
> > [Thread-25]09:32:09,398 ERROR: Error in copyfile.
> > java.io.FileNotFoundException: /glusterfs/some/directory/somefile.txt
> > (Input/output error)
> > ---------
> >
> > glusterfs.log (on client)
> > ---------
> > 2008-12-17 09:32:09 W [afr-self-heal-common.c:1005:afr_self_heal]
> > nasdash-afr: performing self heal on
> > /glusterfs/some/directory/somefile.txt (metadata=0 data=1 entry=0)
> > 2008-12-17 09:32:09 E [afr-self-heal-data.c:777:afr_sh_data_fix]
> > nasdash-afr: Unable to resolve conflicting data of
> > /glusterfs/some/directory/somefile.txt. Please resolve manually by
> > deleting the file /glusterfs/some/directory/somefile.txt from all but
> > the preferred subvolume
> > 2008-12-17 09:32:09 W [afr-self-heal-data.c:70:afr_sh_data_done]
> > nasdash-afr: self heal of /glusterfs/some/directory/somefile.txt completed
> > 2008-12-17 09:32:09 E [fuse-bridge.c:662:fuse_fd_cbk] glusterfs-fuse:
> > 189804: OPEN() /glusterfs/some/directory/somefile.txt => -1
> > (Input/output error)
> > ---------
> >
> >
> > Comments ?
> >
> >
> > --
> > Daniel Maher <dma+gluster AT witbe DOT net>
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
> >
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users