Geo-replication fails to even TRY to start doesnt create /tmp/gsyncd-* dir

joe at julianfamily.org (Joe Julian) · Tue, 30 Jul 2013 07:03:54 -0700

Are you using the zfs that doesn't allow setting extended attributes on symlinks?

Tony Maro <tonym at evrichart.com> wrote:
>Well I guess I'm carrying on a conversation with myself here, but I've
>turned on Debug and gsyncd appears to be crashing in _query_xattr -
>which
>is odd because as mentioned before I was previously able to get this
>volume
>to sync the first 1TB of data before this started, but now it won't
>even do
>that.
>
>To recap, I'm trying to set up geo-rep over SSH.  The Gluster volume is
>a
>mirror setup with two bricks.  The underlying filesystem is ZFS on both
>source and destination.  The SSH session appears to be started by the
>client, as the auth log on the destination server does log the
>following:
>
>Jul 30 08:21:37 backup-ds2 sshd[4364]: Accepted publickey for root from
>10.200.1.6 port 38865 ssh2
>Jul 30 08:21:37 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>opened for user root by (uid=0)
>Jul 30 08:21:51 backup-ds2 sshd[4364]: Received disconnect from
>10.200.1.6:
>11: disconnected by user
>Jul 30 08:21:51 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>closed for user root
>
>I begin the geo-rep with the following command:
>
>gluster volume geo-replication docstore1
>root at backup-ds2.gluster:/data/docstore1
>start
>
>Checking the status will show "starting..." for about 7 seconds and
>then it
>goes "faulty".
>
>The debug gluster.log file on the brick I run the command from shows:
>
>[2013-07-30 08:21:37.224934] I [monitor(monitor):21:set_state] Monitor:
>new
>state: starting...
>[2013-07-30 08:21:37.235110] I [monitor(monitor):80:monitor] Monitor:
>------------------------------------------------------------
>[2013-07-30 08:21:37.235295] I [monitor(monitor):81:monitor] Monitor:
>starting gsyncd worker
>[2013-07-30 08:21:37.298254] I [gsyncd:354:main_i] <top>: syncing:
>gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>:/data/docstore1
>[2013-07-30 08:21:37.302464] D [repce:175:push] RepceClient: call
>21246:139871057643264:1375186897.3 __repce_version__() ...
>[2013-07-30 08:21:39.376665] D [repce:190:__call__] RepceClient: call
>21246:139871057643264:1375186897.3 __repce_version__ -> 1.0
>[2013-07-30 08:21:39.376894] D [repce:175:push] RepceClient: call
>21246:139871057643264:1375186899.38 version() ...
>[2013-07-30 08:21:39.378207] D [repce:190:__call__] RepceClient: call
>21246:139871057643264:1375186899.38 version -> 1.0
>[2013-07-30 08:21:39.393198] D [resource:701:inhibit] DirectMounter:
>auxiliary glusterfs mount in place
>[2013-07-30 08:21:43.408195] D [resource:747:inhibit] DirectMounter:
>auxiliary glusterfs mount prepared
>[2013-07-30 08:21:43.408740] D [monitor(monitor):96:monitor] Monitor:
>worker seems to be connected (?? racy check)
>[2013-07-30 08:21:43.410413] D [repce:175:push] RepceClient: call
>21246:139870643156736:1375186903.41 keep_alive(None,) ...
>[2013-07-30 08:21:43.411798] D [repce:190:__call__] RepceClient: call
>21246:139870643156736:1375186903.41 keep_alive -> 1
>[2013-07-30 08:21:44.449774] D [master:220:volinfo_state_machine]
><top>:
>(None, None) << (None, 24f8c92d) -> (None, 24f8c92d)
>[2013-07-30 08:21:44.450082] I [master:284:crawl] GMaster: new master
>is
>24f8c92d-723e-4513-9593-40ef4b7e766a
>[2013-07-30 08:21:44.450254] I [master:288:crawl] GMaster: primary
>master
>with volume id 24f8c92d-723e-4513-9593-40ef4b7e766a ...
>[2013-07-30 08:21:44.450398] D [master:302:crawl] GMaster: entering .
>[2013-07-30 08:21:44.451534] E [syncdutils:178:log_raise_exception]
><top>:
>glusterfs session went down [ENOTCONN]
>[2013-07-30 08:21:44.451721] E [syncdutils:184:log_raise_exception]
><top>:
>FULL EXCEPTION TRACE:
>Traceback (most recent call last):
>  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>115, in main
>    main_i()
>  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>365, in main_i
>    local.service_loop(*[r for r in [remote] if r])
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>827, in service_loop
>    GMaster(self, args[0]).crawl_loop()
>  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>143, in crawl_loop
>    self.crawl()
>  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>304, in crawl
>    xtl = self.xtime(path)
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>74,
>in xtime
>    xt = rsc.server.xtime(path, self.uuid)
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>270, in ff
>    return f(*a)
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>365, in xtime
>    return struct.unpack('!II', Xattr.lgetxattr(path,
>'.'.join([cls.GX_NSPACE, uuid, 'xtime']), 8))
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>line
>43, in lgetxattr
>    return cls._query_xattr( path, siz, 'lgetxattr', attr)
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>line
>35, in _query_xattr
>    cls.raise_oserr()
>File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>line
>25, in raise_oserr
>    raise OSError(errn, os.strerror(errn))
>OSError: [Errno 107] Transport endpoint is not connected
>[2013-07-30 08:21:44.453290] I [syncdutils:142:finalize] <top>:
>exiting.
>[2013-07-30 08:21:45.411412] D [monitor(monitor):100:monitor] Monitor:
>worker died in startup phase
>[2013-07-30 08:21:45.411653] I [monitor(monitor):21:set_state] Monitor:
>new
>state: faulty
>[2013-07-30 08:21:51.165136] I [syncdutils(monitor):142:finalize]
><top>:
>exiting.
>
>
>
>On Fri, Jul 26, 2013 at 10:42 AM, Tony Maro <tonym at evrichart.com>
>wrote:
>
>> Correction: Manually running the command after creating the temp
>directory
>> actually doesn't work, but it doesn't error out it just hangs and
>never
>> connects to the remote server.  Dunno if this is something within
>gsyncd or
>> what...
>>
>>
>> On Fri, Jul 26, 2013 at 10:38 AM, Tony Maro <tonym at evrichart.com>
>wrote:
>>
>>> Setting up Geo-replication with an existing 3 TB of data is turning
>out
>>> to be a huge pain.
>>>
>>> It was working for a bit but would go faulty by the time it hit 1TB
>>> synced.  Multiple attempts resulted in the same thing.
>>>
>>> Now, I don't know what's changed, but it never actually tries to log
>into
>>> the remote server anymore.  Checking "last" logs on the destination
>shows
>>> that it never actually attempts to make the SSH connection.  The
>>> geo-replication command is as such:
>>>
>>> gluster volume geo-replication docstore1
>root at backup-ds2.gluster:/data/docstore1
>>> start
>>>
>>> From the log:
>>>
>>> [2013-07-26 10:26:04.317667] I [gsyncd:354:main_i] <top>: syncing:
>>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>>> :/data/docstore1
>>> [2013-07-26 10:26:08.258853] I [syncdutils(monitor):142:finalize]
><top>:
>>> exiting.
>>> [2013-07-26 10:26:08.259452] E [syncdutils:173:log_raise_exception]
>>> <top>: connection to peer is broken
>>> *[2013-07-26 10:26:08.260386] E [resource:191:errlog] Popen: command
>>> "ssh -oControlMaster=auto -S
>/tmp/gsyncd-aux-ssh-WlTfNb/gsycnd-ssh-%r@%h:%p
>>> root at backup-ds2.gluster /usr/lib/glusterfs/glusterfs/gsyncd
>>> --session-owner 24f8c92d-723e-4513-9593-40ef4b7e766a -N --listen
>--timeout
>>> 120 file:///data/docstore1" returned with 143*
>>>
>>> When I attempt to run the SSH command from the logs directly in the
>>> console, ssh replies with:
>>>
>>> muxserver_listen bind(): No such file or directory
>>>
>>> And, there's no gsyncd temp directory where specified.  If I
>manually
>>> create that directory and re-run the same command it works.  The
>problem of
>>> course is that the tmp directory is randomly named and starting
>Gluster
>>> geo-rep again will result in a new directory it tries to use.
>>>
>>> Running Gluster 3.3.1-ubuntu1~precise9
>>>
>>> Any ideas why this would be happening?  I did find that my Ubuntu
>>> packages were trying to access gsyncd in the wrong path so I
>corrected
>>> things.  I've also got auto-ssh login using root so I changed my ssh
>>> command (and my global ssh config) to make sure the options would
>work.
>>>  Here's the important geo-rep configs:
>>>
>>> ssh_command: ssh
>>> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
>>> gluster_command_dir: /usr/sbin/
>>> gluster_params: xlator-option=*-dht.assert-no-child-down=true
>>>
>>> Thanks,
>>> Tony
>>>
>>
>>
>>
>> --
>> Thanks,
>>
>> *Tony Maro*
>> Chief Information Officer
>> EvriChart ? www.evrichart.com
>> Advanced Records Management
>> Office | 888.801.2020 ? 304.536.1290
>>
>>
>
>
>-- 
>Thanks,
>
>*Tony Maro*
>Chief Information Officer
>EvriChart ? www.evrichart.com
>Advanced Records Management
>Office | 888.801.2020 ? 304.536.1290
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130730/3dcafcca/attachment.html>