Hi Carl, On 2011-07-07, Carl Chenet <chaica at ohmytux.com> wrote: > On 07/07/2011 15:25, Kaushik BV wrote: >> Hi Chaica, >> >> This primarily means that the RPC communtication between the master >> gsyncd module and slave gsyncd module is broken, this could happen to >> various reasons. Check if it satisies all the pre-requisites: >> >> - If FUSE is installed in the machine, since Geo-replication module >> mounts the GlusterFS volume using FUSE to sync data. >> - If the Slave is a volume, check if the volume is started. >> - If the Slave is a plain directory, check if the directory has been >> created already with the desired permissions (Not applicable in your case) >> - If Glusterfs 3.2 is not installed in the default location (in Master) >> and has been prefixed to be installed in a custom location, configure >> the *gluster-command* for it to point to exact location. >> - If Glusterfs 3.2 is not installed in the default location (in slave) >> and has been prefixed to be installed in a custom location, configure >> the *remote-gsyncd-command* for it to point to exact place where gsyncd >> is located. >> - locate the slave log and see if it has any anomalies. >> - Passwordless SSH is set up properly between the host and the remote >> machine ( Not applicable in your case) > > Ok the situation has slightly evolved. Now I do have a slave log and > clearer error message on the master : > > > [2011-07-07 19:53:16.258866] I [monitor(monitor):42:monitor] Monitor: > ------------------------------------------------------------ > [2011-07-07 19:53:16.259073] I [monitor(monitor):43:monitor] Monitor: > starting gsyncd worker > [2011-07-07 19:53:16.332720] I [gsyncd:286:main_i] <top>: syncing: > gluster://localhost:test-volume -> ssh://192.168.1.32::test-volume > [2011-07-07 19:53:16.343554] D [repce:131:push] RepceClient: call > 6302:140305661662976:1310061196.34 __repce_version__() ... > [2011-07-07 19:53:20.931523] D [repce:141:__call__] RepceClient: call > 6302:140305661662976:1310061196.34 __repce_version__ -> 1.0 > [2011-07-07 19:53:20.932172] D [repce:131:push] RepceClient: call > 6302:140305661662976:1310061200.93 version() ... > [2011-07-07 19:53:20.933662] D [repce:141:__call__] RepceClient: call > 6302:140305661662976:1310061200.93 version -> 1.0 > [2011-07-07 19:53:20.933861] D [repce:131:push] RepceClient: call > 6302:140305661662976:1310061200.93 pid() ... > [2011-07-07 19:53:20.934525] D [repce:141:__call__] RepceClient: call > 6302:140305661662976:1310061200.93 pid -> 10075 > [2011-07-07 19:53:20.957355] E [syncdutils:131:log_raise_exception] ><top>: FAIL: > Traceback (most recent call last): > File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line > 102, in main > main_i() > File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line > 293, in main_i > local.connect() > File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", > line 379, in connect > raise RuntimeError("command failed: " + " ".join(argv)) > RuntimeError: command failed: /usr/sbin/glusterfs --xlator-option > *-dht.assert-no-child-down=true -L DEBUG -l > /var/log/glusterfs/geo-replication/test-volume/ssh%3A%2F%2Froot%40192.168.1.32%3Agluster%3A%2F%2F127.0.0.1%3Atest-volume.gluster.log > -s localhost --volfile-id test-volume --client-pid=-1 > /tmp/gsyncd-aux-mount-hy6T_w > [2011-07-07 19:53:20.960621] D [monitor(monitor):58:monitor] Monitor: > worker seems to be connected (?? racy check) > [2011-07-07 19:53:21.962501] D [monitor(monitor):62:monitor] Monitor: > worker died in startup phase > > The command launched by glusterfs returns a 255 error shell code, which > I belive means the command is terminated by a signal. On the slave log I > have : > > [2011-07-07 19:54:49.571549] I [fuse-bridge.c:3218:fuse_thread_proc] > 0-fuse: unmounting /tmp/gsyncd-aux-mount-z2Q2Hg > [2011-07-07 19:54:49.572459] W [glusterfsd.c:712:cleanup_and_exit] > (-->/lib/libc.so.6(clone+0x6d) [0x7f2c8998b02d] > (-->/lib/libpthread.so.0(+0x68ba) [0x7f2c89c238ba] > (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xc5) [0x7f2c8a8f51b5]))) > 0-: received signum (15), shutting down > [2011-07-07 19:54:51.280207] W [write-behind.c:3029:init] > 0-test-volume-write-behind: disabling write-behind for first 0 bytes > [2011-07-07 19:54:51.291669] I [client.c:1935:notify] > 0-test-volume-client-0: parent translators are ready, attempting connect > on transport > [2011-07-07 19:54:51.292329] I [client.c:1935:notify] > 0-test-volume-client-1: parent translators are ready, attempting connect > on transport > [2011-07-07 19:55:38.582926] I [rpc-clnt.c:1531:rpc_clnt_reconfig] > 0-test-volume-client-0: changing port to 24009 (from 0) > [2011-07-07 19:55:38.583456] I [rpc-clnt.c:1531:rpc_clnt_reconfig] > 0-test-volume-client-1: changing port to 24009 (from 0) This is the content of the /var/log/glusterfs/geo-replication/test-volume/ssh%3A%2F%2Froot%40192.168.1.32%3Agluster%3A%2F%2F127.0.0.1%3Atest-volume.gluster.log file, right (in that case it's not "slave log" but "master side gluster helper's log" :)) ? Can you re-post this log after setting the respective log level to DEBUG by the command # gluster volume geo-replication test-volume root at 192.168.1.32::test-volume config gluster-log-level DEBUG (I hope I "reverse engineered" the volume name / url properly, but if not then adjust the command as needed). The parts _before_ the "unmounting /tmp/gsyncd-aux-mount-z2Q2Hg" message would be the interesting stuff. (Unless there is some randomicity in the phenomenon and you get a different failure symptom next time...) Csaba