Well I guess I'm carrying on a conversation with myself here, but I've turned on Debug and gsyncd appears to be crashing in _query_xattr - which is odd because as mentioned before I was previously able to get this volume to sync the first 1TB of data before this started, but now it won't even do that. To recap, I'm trying to set up geo-rep over SSH. The Gluster volume is a mirror setup with two bricks. The underlying filesystem is ZFS on both source and destination. The SSH session appears to be started by the client, as the auth log on the destination server does log the following: Jul 30 08:21:37 backup-ds2 sshd[4364]: Accepted publickey for root from 10.200.1.6 port 38865 ssh2 Jul 30 08:21:37 backup-ds2 sshd[4364]: pam_unix(sshd:session): session opened for user root by (uid=0) Jul 30 08:21:51 backup-ds2 sshd[4364]: Received disconnect from 10.200.1.6: 11: disconnected by user Jul 30 08:21:51 backup-ds2 sshd[4364]: pam_unix(sshd:session): session closed for user root I begin the geo-rep with the following command: gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1 start Checking the status will show "starting..." for about 7 seconds and then it goes "faulty". The debug gluster.log file on the brick I run the command from shows: [2013-07-30 08:21:37.224934] I [monitor(monitor):21:set_state] Monitor: new state: starting... [2013-07-30 08:21:37.235110] I [monitor(monitor):80:monitor] Monitor: ------------------------------------------------------------ [2013-07-30 08:21:37.235295] I [monitor(monitor):81:monitor] Monitor: starting gsyncd worker [2013-07-30 08:21:37.298254] I [gsyncd:354:main_i] <top>: syncing: gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster :/data/docstore1 [2013-07-30 08:21:37.302464] D [repce:175:push] RepceClient: call 21246:139871057643264:1375186897.3 __repce_version__() ... [2013-07-30 08:21:39.376665] D [repce:190:__call__] RepceClient: call 21246:139871057643264:1375186897.3 __repce_version__ -> 1.0 [2013-07-30 08:21:39.376894] D [repce:175:push] RepceClient: call 21246:139871057643264:1375186899.38 version() ... [2013-07-30 08:21:39.378207] D [repce:190:__call__] RepceClient: call 21246:139871057643264:1375186899.38 version -> 1.0 [2013-07-30 08:21:39.393198] D [resource:701:inhibit] DirectMounter: auxiliary glusterfs mount in place [2013-07-30 08:21:43.408195] D [resource:747:inhibit] DirectMounter: auxiliary glusterfs mount prepared [2013-07-30 08:21:43.408740] D [monitor(monitor):96:monitor] Monitor: worker seems to be connected (?? racy check) [2013-07-30 08:21:43.410413] D [repce:175:push] RepceClient: call 21246:139870643156736:1375186903.41 keep_alive(None,) ... [2013-07-30 08:21:43.411798] D [repce:190:__call__] RepceClient: call 21246:139870643156736:1375186903.41 keep_alive -> 1 [2013-07-30 08:21:44.449774] D [master:220:volinfo_state_machine] <top>: (None, None) << (None, 24f8c92d) -> (None, 24f8c92d) [2013-07-30 08:21:44.450082] I [master:284:crawl] GMaster: new master is 24f8c92d-723e-4513-9593-40ef4b7e766a [2013-07-30 08:21:44.450254] I [master:288:crawl] GMaster: primary master with volume id 24f8c92d-723e-4513-9593-40ef4b7e766a ... [2013-07-30 08:21:44.450398] D [master:302:crawl] GMaster: entering . [2013-07-30 08:21:44.451534] E [syncdutils:178:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-07-30 08:21:44.451721] E [syncdutils:184:log_raise_exception] <top>: FULL EXCEPTION TRACE: Traceback (most recent call last): File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line 115, in main main_i() File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line 365, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line 827, in service_loop GMaster(self, args[0]).crawl_loop() File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line 143, in crawl_loop self.crawl() File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line 304, in crawl xtl = self.xtime(path) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line 74, in xtime xt = rsc.server.xtime(path, self.uuid) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line 270, in ff return f(*a) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line 365, in xtime return struct.unpack('!II', Xattr.lgetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'xtime']), 8)) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line 43, in lgetxattr return cls._query_xattr( path, siz, 'lgetxattr', attr) File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line 35, in _query_xattr cls.raise_oserr() File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 107] Transport endpoint is not connected [2013-07-30 08:21:44.453290] I [syncdutils:142:finalize] <top>: exiting. [2013-07-30 08:21:45.411412] D [monitor(monitor):100:monitor] Monitor: worker died in startup phase [2013-07-30 08:21:45.411653] I [monitor(monitor):21:set_state] Monitor: new state: faulty [2013-07-30 08:21:51.165136] I [syncdutils(monitor):142:finalize] <top>: exiting. On Fri, Jul 26, 2013 at 10:42 AM, Tony Maro <tonym at evrichart.com> wrote: > Correction: Manually running the command after creating the temp directory > actually doesn't work, but it doesn't error out it just hangs and never > connects to the remote server. Dunno if this is something within gsyncd or > what... > > > On Fri, Jul 26, 2013 at 10:38 AM, Tony Maro <tonym at evrichart.com> wrote: > >> Setting up Geo-replication with an existing 3 TB of data is turning out >> to be a huge pain. >> >> It was working for a bit but would go faulty by the time it hit 1TB >> synced. Multiple attempts resulted in the same thing. >> >> Now, I don't know what's changed, but it never actually tries to log into >> the remote server anymore. Checking "last" logs on the destination shows >> that it never actually attempts to make the SSH connection. The >> geo-replication command is as such: >> >> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1 >> start >> >> From the log: >> >> [2013-07-26 10:26:04.317667] I [gsyncd:354:main_i] <top>: syncing: >> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster >> :/data/docstore1 >> [2013-07-26 10:26:08.258853] I [syncdutils(monitor):142:finalize] <top>: >> exiting. >> [2013-07-26 10:26:08.259452] E [syncdutils:173:log_raise_exception] >> <top>: connection to peer is broken >> *[2013-07-26 10:26:08.260386] E [resource:191:errlog] Popen: command >> "ssh -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-WlTfNb/gsycnd-ssh-%r@%h:%p >> root at backup-ds2.gluster /usr/lib/glusterfs/glusterfs/gsyncd >> --session-owner 24f8c92d-723e-4513-9593-40ef4b7e766a -N --listen --timeout >> 120 file:///data/docstore1" returned with 143* >> >> When I attempt to run the SSH command from the logs directly in the >> console, ssh replies with: >> >> muxserver_listen bind(): No such file or directory >> >> And, there's no gsyncd temp directory where specified. If I manually >> create that directory and re-run the same command it works. The problem of >> course is that the tmp directory is randomly named and starting Gluster >> geo-rep again will result in a new directory it tries to use. >> >> Running Gluster 3.3.1-ubuntu1~precise9 >> >> Any ideas why this would be happening? I did find that my Ubuntu >> packages were trying to access gsyncd in the wrong path so I corrected >> things. I've also got auto-ssh login using root so I changed my ssh >> command (and my global ssh config) to make sure the options would work. >> Here's the important geo-rep configs: >> >> ssh_command: ssh >> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd >> gluster_command_dir: /usr/sbin/ >> gluster_params: xlator-option=*-dht.assert-no-child-down=true >> >> Thanks, >> Tony >> > > > > -- > Thanks, > > *Tony Maro* > Chief Information Officer > EvriChart ? www.evrichart.com > Advanced Records Management > Office | 888.801.2020 ? 304.536.1290 > > -- Thanks, *Tony Maro* Chief Information Officer EvriChart ? www.evrichart.com Advanced Records Management Office | 888.801.2020 ? 304.536.1290 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130730/761c534b/attachment.html>