Issues with geo-rep

csaba at gluster.com (Csaba Henk) · Thu, 7 Jul 2011 19:32:11 +0000 (UTC)

Hi Carl,

On 2011-07-07, Carl Chenet <chaica at ohmytux.com> wrote:
> On 07/07/2011 15:25, Kaushik BV wrote:
>> Hi Chaica,
>>
>> This primarily means that the RPC communtication between the master
>> gsyncd module and slave gsyncd module is broken, this could happen to
>> various reasons. Check if it satisies all the pre-requisites:
>>
>> - If FUSE is installed in the machine, since Geo-replication module
>> mounts the GlusterFS volume using FUSE to sync data.
>> - If the Slave is a volume, check if the volume is started.
>> - If the Slave is a plain directory, check if the directory has been
>> created already with the desired permissions (Not applicable in your case)
>> - If Glusterfs 3.2 is not installed in the default location (in Master)
>> and has been prefixed to be installed in a custom location, configure
>> the *gluster-command*  for it to point to exact location.
>> - If Glusterfs 3.2 is not installed in the default location (in slave)
>> and has been prefixed to be installed in a custom location, configure
>> the *remote-gsyncd-command*  for it to point to exact place where gsyncd
>>   is located.
>> - locate the slave log and see if it has any anomalies.
>> - Passwordless SSH is set up properly between the host and the remote
>> machine ( Not applicable in your case)
>
> Ok the situation has slightly evolved. Now I do have a slave log and 
> clearer error message on the master :
>
>
> [2011-07-07 19:53:16.258866] I [monitor(monitor):42:monitor] Monitor: 
> ------------------------------------------------------------
> [2011-07-07 19:53:16.259073] I [monitor(monitor):43:monitor] Monitor: 
> starting gsyncd worker
> [2011-07-07 19:53:16.332720] I [gsyncd:286:main_i] <top>: syncing: 
> gluster://localhost:test-volume -> ssh://192.168.1.32::test-volume
> [2011-07-07 19:53:16.343554] D [repce:131:push] RepceClient: call 
> 6302:140305661662976:1310061196.34 __repce_version__() ...
> [2011-07-07 19:53:20.931523] D [repce:141:__call__] RepceClient: call 
> 6302:140305661662976:1310061196.34 __repce_version__ -> 1.0
> [2011-07-07 19:53:20.932172] D [repce:131:push] RepceClient: call 
> 6302:140305661662976:1310061200.93 version() ...
> [2011-07-07 19:53:20.933662] D [repce:141:__call__] RepceClient: call 
> 6302:140305661662976:1310061200.93 version -> 1.0
> [2011-07-07 19:53:20.933861] D [repce:131:push] RepceClient: call 
> 6302:140305661662976:1310061200.93 pid() ...
> [2011-07-07 19:53:20.934525] D [repce:141:__call__] RepceClient: call 
> 6302:140305661662976:1310061200.93 pid -> 10075
> [2011-07-07 19:53:20.957355] E [syncdutils:131:log_raise_exception] 
><top>: FAIL:
> Traceback (most recent call last):
>    File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line 
> 102, in main
>      main_i()
>    File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line 
> 293, in main_i
>      local.connect()
>    File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", 
> line 379, in connect
>      raise RuntimeError("command failed: " + " ".join(argv))
> RuntimeError: command failed: /usr/sbin/glusterfs --xlator-option 
> *-dht.assert-no-child-down=true -L DEBUG -l 
> /var/log/glusterfs/geo-replication/test-volume/ssh%3A%2F%2Froot%40192.168.1.32%3Agluster%3A%2F%2F127.0.0.1%3Atest-volume.gluster.log 
> -s localhost --volfile-id test-volume --client-pid=-1 
> /tmp/gsyncd-aux-mount-hy6T_w
> [2011-07-07 19:53:20.960621] D [monitor(monitor):58:monitor] Monitor: 
> worker seems to be connected (?? racy check)
> [2011-07-07 19:53:21.962501] D [monitor(monitor):62:monitor] Monitor: 
> worker died in startup phase
>
> The command launched by glusterfs returns a 255 error shell code, which 
> I belive means the command is terminated by a signal. On the slave log I 
> have :
>
> [2011-07-07 19:54:49.571549] I [fuse-bridge.c:3218:fuse_thread_proc] 
> 0-fuse: unmounting /tmp/gsyncd-aux-mount-z2Q2Hg
> [2011-07-07 19:54:49.572459] W [glusterfsd.c:712:cleanup_and_exit] 
> (-->/lib/libc.so.6(clone+0x6d) [0x7f2c8998b02d] 
> (-->/lib/libpthread.so.0(+0x68ba) [0x7f2c89c238ba] 
> (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xc5) [0x7f2c8a8f51b5]))) 
> 0-: received signum (15), shutting down
> [2011-07-07 19:54:51.280207] W [write-behind.c:3029:init] 
> 0-test-volume-write-behind: disabling write-behind for first 0 bytes
> [2011-07-07 19:54:51.291669] I [client.c:1935:notify] 
> 0-test-volume-client-0: parent translators are ready, attempting connect 
> on transport
> [2011-07-07 19:54:51.292329] I [client.c:1935:notify] 
> 0-test-volume-client-1: parent translators are ready, attempting connect 
> on transport
> [2011-07-07 19:55:38.582926] I [rpc-clnt.c:1531:rpc_clnt_reconfig] 
> 0-test-volume-client-0: changing port to 24009 (from 0)
> [2011-07-07 19:55:38.583456] I [rpc-clnt.c:1531:rpc_clnt_reconfig] 
> 0-test-volume-client-1: changing port to 24009 (from 0)

This is the content of the 

/var/log/glusterfs/geo-replication/test-volume/ssh%3A%2F%2Froot%40192.168.1.32%3Agluster%3A%2F%2F127.0.0.1%3Atest-volume.gluster.log 

file, right (in that case it's not "slave log" but "master side gluster helper's log" :)) ? Can you re-post
this log after setting the respective log level to DEBUG by the command

# gluster volume geo-replication test-volume root at 192.168.1.32::test-volume config gluster-log-level DEBUG

(I hope I "reverse engineered" the volume name / url properly, but if not then adjust the command as needed).
The parts _before_ the "unmounting /tmp/gsyncd-aux-mount-z2Q2Hg" message would be the interesting stuff.
(Unless there is some randomicity in the phenomenon and you get a different failure symptom next time...)

Csaba