Geo-replication fails to even TRY to start doesnt create /tmp/gsyncd-* dir

tonym at evrichart.com (Tony Maro) · Tue, 30 Jul 2013 11:09:36 -0400

I'm using the Ubuntu repositories for Precise ( ppa:zfs-native/stable ), so
not sure, but I can guarantee there are no symlinks anywhere within the
volume.  The data is all created and maintained by one app that I wrote,
and symlinks aren't ever used.

On Tue, Jul 30, 2013 at 10:03 AM, Joe Julian <joe at julianfamily.org> wrote:

> Are you using the zfs that doesn't allow setting extended attributes on
> symlinks?
>
> Tony Maro <tonym at evrichart.com> wrote:
>>
>> Well I guess I'm carrying on a conversation with myself here, but I've
>> turned on Debug and gsyncd appears to be crashing in _query_xattr - which
>> is odd because as mentioned before I was previously able to get this volume
>> to sync the first 1TB of data before this started, but now it won't even do
>> that.
>>
>> To recap, I'm trying to set up geo-rep over SSH.  The Gluster volume is a
>> mirror setup with two bricks.  The underlying filesystem is ZFS on both
>> source and destination.  The SSH session appears to be started by the
>> client, as the auth log on the destination server does log the following:
>>
>> Jul 30 08:21:37 backup-ds2 sshd[4364]: Accepted publickey for root from
>> 10.200.1.6 port 38865 ssh2
>> Jul 30 08:21:37 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>> opened for user root by (uid=0)
>> Jul 30 08:21:51 backup-ds2 sshd[4364]: Received disconnect from
>> 10.200.1.6: 11: disconnected by user
>> Jul 30 08:21:51 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>> closed for user root
>>
>> I begin the geo-rep with the following command:
>>
>> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1
>> start
>>
>> Checking the status will show "starting..." for about 7 seconds and then
>> it goes "faulty".
>>
>> The debug gluster.log file on the brick I run the command from shows:
>>
>> [2013-07-30 08:21:37.224934] I [monitor(monitor):21:set_state] Monitor:
>> new state: starting...
>> [2013-07-30 08:21:37.235110] I [monitor(monitor):80:monitor] Monitor:
>> ------------------------------------------------------------
>> [2013-07-30 08:21:37.235295] I [monitor(monitor):81:monitor] Monitor:
>> starting gsyncd worker
>> [2013-07-30 08:21:37.298254] I [gsyncd:354:main_i] <top>: syncing:
>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>> :/data/docstore1
>> [2013-07-30 08:21:37.302464] D [repce:175:push] RepceClient: call
>> 21246:139871057643264:1375186897.3 __repce_version__() ...
>> [2013-07-30 08:21:39.376665] D [repce:190:__call__] RepceClient: call
>> 21246:139871057643264:1375186897.3 __repce_version__ -> 1.0
>> [2013-07-30 08:21:39.376894] D [repce:175:push] RepceClient: call
>> 21246:139871057643264:1375186899.38 version() ...
>> [2013-07-30 08:21:39.378207] D [repce:190:__call__] RepceClient: call
>> 21246:139871057643264:1375186899.38 version -> 1.0
>> [2013-07-30 08:21:39.393198] D [resource:701:inhibit] DirectMounter:
>> auxiliary glusterfs mount in place
>> [2013-07-30 08:21:43.408195] D [resource:747:inhibit] DirectMounter:
>> auxiliary glusterfs mount prepared
>> [2013-07-30 08:21:43.408740] D [monitor(monitor):96:monitor] Monitor:
>> worker seems to be connected (?? racy check)
>> [2013-07-30 08:21:43.410413] D [repce:175:push] RepceClient: call
>> 21246:139870643156736:1375186903.41 keep_alive(None,) ...
>> [2013-07-30 08:21:43.411798] D [repce:190:__call__] RepceClient: call
>> 21246:139870643156736:1375186903.41 keep_alive -> 1
>> [2013-07-30 08:21:44.449774] D [master:220:volinfo_state_machine] <top>:
>> (None, None) << (None, 24f8c92d) -> (None, 24f8c92d)
>> [2013-07-30 08:21:44.450082] I [master:284:crawl] GMaster: new master is
>> 24f8c92d-723e-4513-9593-40ef4b7e766a
>> [2013-07-30 08:21:44.450254] I [master:288:crawl] GMaster: primary master
>> with volume id 24f8c92d-723e-4513-9593-40ef4b7e766a ...
>> [2013-07-30 08:21:44.450398] D [master:302:crawl] GMaster: entering .
>> [2013-07-30 08:21:44.451534] E [syncdutils:178:log_raise_exception]
>> <top>: glusterfs session went down [ENOTCONN]
>> [2013-07-30 08:21:44.451721] E [syncdutils:184:log_raise_exception]
>> <top>: FULL EXCEPTION TRACE:
>> Traceback (most recent call last):
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>> 115, in main
>>     main_i()
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>> 365, in main_i
>>     local.service_loop(*[r for r in [remote] if r])
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>> 827, in service_loop
>>     GMaster(self, args[0]).crawl_loop()
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>> 143, in crawl_loop
>>     self.crawl()
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>> 304, in crawl
>>     xtl = self.xtime(path)
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>> 74, in xtime
>>     xt = rsc.server.xtime(path, self.uuid)
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>> 270, in ff
>>     return f(*a)
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
>> 365, in xtime
>>     return struct.unpack('!II', Xattr.lgetxattr(path,
>> '.'.join([cls.GX_NSPACE, uuid, 'xtime']), 8))
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>> line 43, in lgetxattr
>>     return cls._query_xattr( path, siz, 'lgetxattr', attr)
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>> line 35, in _query_xattr
>>     cls.raise_oserr()
>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>> line 25, in raise_oserr
>>     raise OSError(errn, os.strerror(errn))
>> OSError: [Errno 107] Transport endpoint is not connected
>> [2013-07-30 08:21:44.453290] I [syncdutils:142:finalize] <top>: exiting.
>> [2013-07-30 08:21:45.411412] D [monitor(monitor):100:monitor] Monitor:
>> worker died in startup phase
>> [2013-07-30 08:21:45.411653] I [monitor(monitor):21:set_state] Monitor:
>> new state: faulty
>> [2013-07-30 08:21:51.165136] I [syncdutils(monitor):142:finalize] <top>:
>> exiting.
>>
>>
>>
>> On Fri, Jul 26, 2013 at 10:42 AM, Tony Maro <tonym at evrichart.com> wrote:
>>
>>> Correction: Manually running the command after creating the temp
>>> directory actually doesn't work, but it doesn't error out it just hangs and
>>> never connects to the remote server.  Dunno if this is something within
>>> gsyncd or what...
>>>
>>>
>>> On Fri, Jul 26, 2013 at 10:38 AM, Tony Maro <tonym at evrichart.com> wrote:
>>>
>>>> Setting up Geo-replication with an existing 3 TB of data is turning out
>>>> to be a huge pain.
>>>>
>>>> It was working for a bit but would go faulty by the time it hit 1TB
>>>> synced.  Multiple attempts resulted in the same thing.
>>>>
>>>> Now, I don't know what's changed, but it never actually tries to log
>>>> into the remote server anymore.  Checking "last" logs on the destination
>>>> shows that it never actually attempts to make the SSH connection.  The
>>>> geo-replication command is as such:
>>>>
>>>> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1
>>>> start
>>>>
>>>> From the log:
>>>>
>>>> [2013-07-26 10:26:04.317667] I [gsyncd:354:main_i] <top>: syncing:
>>>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>>>> :/data/docstore1
>>>> [2013-07-26 10:26:08.258853] I [syncdutils(monitor):142:finalize]
>>>> <top>: exiting.
>>>> [2013-07-26 10:26:08.259452] E [syncdutils:173:log_raise_exception]
>>>> <top>: connection to peer is broken
>>>> *[2013-07-26 10:26:08.260386] E [resource:191:errlog] Popen: command
>>>> "ssh -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-WlTfNb/gsycnd-ssh-%r@%h:%p
>>>> root at backup-ds2.gluster /usr/lib/glusterfs/glusterfs/gsyncd
>>>> --session-owner 24f8c92d-723e-4513-9593-40ef4b7e766a -N --listen --timeout
>>>> 120 file:///data/docstore1" returned with 143*
>>>>
>>>> When I attempt to run the SSH command from the logs directly in the
>>>> console, ssh replies with:
>>>>
>>>> muxserver_listen bind(): No such file or directory
>>>>
>>>> And, there's no gsyncd temp directory where specified.  If I manually
>>>> create that directory and re-run the same command it works.  The problem of
>>>> course is that the tmp directory is randomly named and starting Gluster
>>>> geo-rep again will result in a new directory it tries to use.
>>>>
>>>> Running Gluster 3.3.1-ubuntu1~precise9
>>>>
>>>> Any ideas why this would be happening?  I did find that my Ubuntu
>>>> packages were trying to access gsyncd in the wrong path so I corrected
>>>> things.  I've also got auto-ssh login using root so I changed my ssh
>>>> command (and my global ssh config) to make sure the options would work.
>>>>  Here's the important geo-rep configs:
>>>>
>>>> ssh_command: ssh
>>>> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
>>>> gluster_command_dir: /usr/sbin/
>>>> gluster_params: xlator-option=*-dht.assert-no-child-down=true
>>>>
>>>> Thanks,
>>>> Tony
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> *Tony Maro*
>>> Chief Information Officer
>>> EvriChart ? www.evrichart.com
>>> Advanced Records Management
>>> Office | 888.801.2020 ? 304.536.1290
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> *Tony Maro*
>> Chief Information Officer
>> EvriChart ? www.evrichart.com
>> Advanced Records Management
>> Office | 888.801.2020 ? 304.536.1290
>>
>>  ------------------------------
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>>

-- 
Thanks,

*Tony Maro*
Chief Information Officer
EvriChart ? www.evrichart.com
Advanced Records Management
Office | 888.801.2020 ? 304.536.1290
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130730/504c6c8a/attachment.html>