file locked / inaccessible if auto-heal required & confusing log messages (1.4rc3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Daniel,

Here the "selfheal complete" is acutaly "selfheal complete
unsuccessfully". It does not heal the file and open returns error. The
healing code detects a conflicting case if it sees that both the
subvols say that they are the latest and other is outdated. We see
this happen when there is a split brain situation (network between AFR
servers is broken and different clients write to each AFR
independently) or in a very rare case where one of the servers go down
right when a write operation is happening. I think you have hit the
2nd case. Here AFR can not really decide on which subvol has the
latest version hence it leaves is to the discretion of the user.
Earlier 1.3 AFR did not handle the split brain situation hence you did
not see this.

Krishna

On Wed, Dec 17, 2008 at 8:10 PM, Daniel Maher <dma+gluster at witbe.net> wrote:
> Hello,
>
> I recently upgraded my infrastructure from a 1.3.12 server-based AFR
> cluster to a 1.4rc3 client-based AFR cluster.  Among other things, i
> have noticed one very obvious change in the behaviour of self-healing
> between the two setups...
>
> The scenario is basic : one of the server nodes becomes inaccessible,
> and as a result, changes to a given file are not replicated.  When the
> downed node returns, and the file is accessed, the self-heal feature is
> triggered, thus ensuring the integrity of the data across all server nodes.
>
> So far so good ; however, between the previous setup and that of the
> current, ? something ? has resulted in differing behaviour vis-?-vis the
> availability of said file.
>
> In the previous 1.3 server-based AFR setup, if a client attempted to
> write to the file, it was able to do so, with the change being
> replicated to the newly-returned node as part of the self-heal process.
>  Perfect.
>
> However, in the current 1.4 client-based AFR setup, if a client attempts
> to write to the file, instead of Gluster accepting the write and
> propagating the change during the self-heal, the file becomes
> momentarily inaccessible.  The self-heal process is then triggered, and
> the file - without the current attempted write - is replicated.
> Subsequent accesses are successful (and replicate as expected), but that
> ? triggering write ? still fails the first time.
>
> Furthermore, the log entry related to this particular process is
> confusing (log excerpt below).  It follows the form :
> 1. Self-heal triggered
> 2. Unable to resolve conflicting data
> 3. Self-heal completed
> 4. File not found
>
> The reported conflict does not, in fact, appear to affect the self-heal,
> in that the file is replicated as expected.  Is the error itself
> erroneous, or is there actually a problem ?  Furthermore, even though
> the file clearly exists, and has in fact just been replicated, Gluster
> reports then throws an error on OPEN.
>
> This can't possibly be the expected behaviour.  What within the
> underlying infrastructure has changed ?  How can it be fixed ?
>
>
> Some log snippets :
>
> Tomcat (on client)
> ---------
> [Thread-25]09:32:09,398 ERROR: Error in copyfile.
> java.io.FileNotFoundException: /glusterfs/some/directory/somefile.txt
> (Input/output error)
> ---------
>
> glusterfs.log (on client)
> ---------
> 2008-12-17 09:32:09 W [afr-self-heal-common.c:1005:afr_self_heal]
> nasdash-afr: performing self heal on
> /glusterfs/some/directory/somefile.txt (metadata=0 data=1 entry=0)
> 2008-12-17 09:32:09 E [afr-self-heal-data.c:777:afr_sh_data_fix]
> nasdash-afr: Unable to resolve conflicting data of
> /glusterfs/some/directory/somefile.txt. Please resolve manually by
> deleting the file /glusterfs/some/directory/somefile.txt from all but
> the preferred subvolume
> 2008-12-17 09:32:09 W [afr-self-heal-data.c:70:afr_sh_data_done]
> nasdash-afr: self heal of /glusterfs/some/directory/somefile.txt completed
> 2008-12-17 09:32:09 E [fuse-bridge.c:662:fuse_fd_cbk] glusterfs-fuse:
> 189804: OPEN() /glusterfs/some/directory/somefile.txt => -1
> (Input/output error)
> ---------
>
>
> Comments ?
>
>
> --
> Daniel Maher <dma+gluster AT witbe DOT net>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux