Géo-rep fail

csaba at gluster.com (Csaba Henk) · Tue, 17 May 2011 14:00:03 +0530

On 05/17/11 13:04, anthony garnier wrote:
> Hi,
> I've put the Client log in Debug mod :
> # gluster volume geo-replication /soft/venus config log-level DEBUG
> geo-replication config updated successfully
>
> # gluster volume geo-replication /soft/venus config log-file
> /usr/local/var/log/glusterfs/geo-replication-slaves/${session_owner}:file%3A%2F%2F%2Fsoft%2Fvenus.log
>
> # gluster volume geo-replication athena /soft/venus config session-owner
> 28cbd261-3a3e-4a5a-b300-ea468483c944
>
> # gluster volume geo-replication athena /soft/venus start
> Starting geo-replication session between athena & /soft/venus has been
> successful
>
> # gluster volume geo-replication athena /soft/venus status
> MASTER SLAVE STATUS
> --------------------------------------------------------------------------------
> athena /soft/venus starting...
>
> and then :
>
> # gluster volume geo-replication athena /soft/venus status
> MASTER SLAVE STATUS
> --------------------------------------------------------------------------------
> athena /soft/venus faulty

Is this an edited output? By all chance, I'd expect to see the full 
slave url, ie. file:///soft/venus in the status output.

> For client :
> cat
> /usr/local/var/log/glusterfs/geo-replication-slaves/28cbd261-3a3e-4a5a-b300-ea468483c944:file%3A%2F%2F%2Fsoft%2Fvenus.log
>
>
> [2011-05-17 09:20:40.519731] I [gsyncd(slave):287:main_i] <top>:
> syncing: file:///soft/venus
> [2011-05-17 09:20:40.520587] I [resource(slave):200:service_loop] FILE:
> slave listening
> [2011-05-17 09:20:40.532951] I [repce(slave):61:service_loop]
> RepceServer: terminating on reaching EOF.
> [2011-05-17 09:21:50.528803] I [gsyncd(slave):287:main_i] <top>:
> syncing: file:///soft/venus
> [2011-05-17 09:21:50.529666] I [resource(slave):200:service_loop] FILE:
> slave listening
> [2011-05-17 09:21:50.542349] I [repce(slave):61:service_loop]
> RepceServer: terminating on reaching EOF.
>
>
>
> For server :
> # cat
> /usr/local/var/log/glusterfs/geo-replication/athena/file%3A%2F%2F%2Fsoft%2Fvenus.log
>
> [2011-05-17 09:30:04.431369] I [monitor(monitor):42:monitor] Monitor:
> ------------------------------------------------------------
> [2011-05-17 09:30:04.431669] I [monitor(monitor):43:monitor] Monitor:
> starting gsyncd worker
> [2011-05-17 09:30:04.486852] I [gsyncd:287:main_i] <top>: syncing:
> gluster://localhost:athena -> file:///soft/venus
[...]
> raise RuntimeError("command failed: " + " ".join(argv))
> RuntimeError: command failed: /usr/local/sbin/glusterfs --xlator-option
> *-dht.assert-no-child-down=true -l
> /usr/local/var/log/glusterfs/geo-replication/athena/file%3A%2F%2F%2Fsoft%2Fvenus.gluster.log
> -s localhost --volfile-id athena --client-pid=-1
> /tmp/gsyncd-aux-mount-TEqjwY
> [2011-05-17 09:30:04.647973] D [monitor(monitor):57:monitor] Monitor:
> worker got connected in 0 sec, waiting 59 more to make sure it's fine

This is interesting in the sense that the error you get now is not the 
same as in your first post. Better said, the _symptoms_ are different,
the error as such might be the same. I can imagine that there is a race
in between exceptional events and it's accidental which one interrupts
the event flow.

So, it seems that the auxiliary glusterfs instance used by master gsyncd
fails. (Sidenote: if you prefer to use client/server terminology instead 
of master/slave, that's fine, but master should be called client and 
slave should be called server, ie. the reverse way you do :) ) To see 
what's wrong with that, I again ask for the respective logs:

## setting DEBUG loglevel for master's aux glusterfs
# gluster volume geo-replication athena /soft/venus config \
      gluster-log-level DEBUG
## getting the path of the logfile of aux glusterfs
# gluster volume geo-replication athena /soft/venus config \
      gluster-log-file

So pls post the latter thingy.

Csaba