Geo-replication fails to even TRY to start doesnt create /tmp/gsyncd-* dir

tonym at evrichart.com (Tony Maro) · Tue, 30 Jul 2013 16:09:55 -0400

The only other thing I can add is the following log entries from the SSH
destination:

[2013-07-30 08:51:15.41] I [gsyncd(slave):289:main_i] <top>: syncing:
file:///data/docstore1
[2013-07-30 08:51:15.1106] I [resource(slave):200:service_loop] FILE: slave
listening
[2013-07-30 08:51:20.81000] I [repce(slave):60:service_loop] RepceServer:
terminating on reaching EOF.
[2013-07-30 08:55:15.154587] I [resource(slave):206:service_loop] FILE:
connection inactive for 120 seconds, stopping
[2013-07-30 08:55:15.154911] I [gsyncd(slave):301:main_i] <top>: exiting.

Which makes sense - it's showing connected and listening and then gets an
EOF from the source server at the same time the source server crashes out
with the xattr issue.

And I must apologize to everyone - I had not realized Google was adding my
signature to the bottom every time I hit reply.  I'll have to see how to
turn that off.

On Tue, Jul 30, 2013 at 11:09 AM, Tony Maro <tonym at evrichart.com> wrote:

> I'm using the Ubuntu repositories for Precise ( ppa:zfs-native/stable ),
> so not sure, but I can guarantee there are no symlinks anywhere within the
> volume.  The data is all created and maintained by one app that I wrote,
> and symlinks aren't ever used.
>
>
> On Tue, Jul 30, 2013 at 10:03 AM, Joe Julian <joe at julianfamily.org> wrote:
>
>> Are you using the zfs that doesn't allow setting extended attributes on
>> symlinks?
>>
>> Tony Maro <tonym at evrichart.com> wrote:
>>>
>>>  Well I guess I'm carrying on a conversation with myself here, but I've
>>> turned on Debug and gsyncd appears to be crashing in _query_xattr - which
>>> is odd because as mentioned before I was previously able to get this volume
>>> to sync the first 1TB of data before this started, but now it won't even do
>>> that.
>>>
>>> To recap, I'm trying to set up geo-rep over SSH.  The Gluster volume is
>>> a mirror setup with two bricks.  The underlying filesystem is ZFS on both
>>> source and destination.  The SSH session appears to be started by the
>>> client, as the auth log on the destination server does log the following:
>>>
>>> Jul 30 08:21:37 backup-ds2 sshd[4364]: Accepted publickey for root from
>>> 10.200.1.6 port 38865 ssh2
>>> Jul 30 08:21:37 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>>> opened for user root by (uid=0)
>>> Jul 30 08:21:51 backup-ds2 sshd[4364]: Received disconnect from
>>> 10.200.1.6: 11: disconnected by user
>>> Jul 30 08:21:51 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
>>> closed for user root
>>>
>>> I begin the geo-rep with the following command:
>>>
>>> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1
>>> start
>>>
>>> Checking the status will show "starting..." for about 7 seconds and then
>>> it goes "faulty".
>>>
>>> The debug gluster.log file on the brick I run the command from shows:
>>>
>>> [2013-07-30 08:21:37.224934] I [monitor(monitor):21:set_state] Monitor:
>>> new state: starting...
>>>  [2013-07-30 08:21:37.235110] I [monitor(monitor):80:monitor] Monitor:
>>> ------------------------------------------------------------
>>> [2013-07-30 08:21:37.235295] I [monitor(monitor):81:monitor] Monitor:
>>> starting gsyncd worker
>>> [2013-07-30 08:21:37.298254] I [gsyncd:354:main_i] <top>: syncing:
>>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>>> :/data/docstore1
>>> [2013-07-30 08:21:37.302464] D [repce:175:push] RepceClient: call
>>> 21246:139871057643264:1375186897.3 __repce_version__() ...
>>> [2013-07-30 08:21:39.376665] D [repce:190:__call__] RepceClient: call
>>> 21246:139871057643264:1375186897.3 __repce_version__ -> 1.0
>>> [2013-07-30 08:21:39.376894] D [repce:175:push] RepceClient: call
>>> 21246:139871057643264:1375186899.38 version() ...
>>> [2013-07-30 08:21:39.378207] D [repce:190:__call__] RepceClient: call
>>> 21246:139871057643264:1375186899.38 version -> 1.0
>>> [2013-07-30 08:21:39.393198] D [resource:701:inhibit] DirectMounter:
>>> auxiliary glusterfs mount in place
>>> [2013-07-30 08:21:43.408195] D [resource:747:inhibit] DirectMounter:
>>> auxiliary glusterfs mount prepared
>>> [2013-07-30 08:21:43.408740] D [monitor(monitor):96:monitor] Monitor:
>>> worker seems to be connected (?? racy check)
>>> [2013-07-30 08:21:43.410413] D [repce:175:push] RepceClient: call
>>> 21246:139870643156736:1375186903.41 keep_alive(None,) ...
>>> [2013-07-30 08:21:43.411798] D [repce:190:__call__] RepceClient: call
>>> 21246:139870643156736:1375186903.41 keep_alive -> 1
>>> [2013-07-30 08:21:44.449774] D [master:220:volinfo_state_machine] <top>:
>>> (None, None) << (None, 24f8c92d) -> (None, 24f8c92d)
>>> [2013-07-30 08:21:44.450082] I [master:284:crawl] GMaster: new master is
>>> 24f8c92d-723e-4513-9593-40ef4b7e766a
>>> [2013-07-30 08:21:44.450254] I [master:288:crawl] GMaster: primary
>>> master with volume id 24f8c92d-723e-4513-9593-40ef4b7e766a ...
>>> [2013-07-30 08:21:44.450398] D [master:302:crawl] GMaster: entering .
>>>  [2013-07-30 08:21:44.451534] E [syncdutils:178:log_raise_exception]
>>> <top>: glusterfs session went down [ENOTCONN]
>>> [2013-07-30 08:21:44.451721] E [syncdutils:184:log_raise_exception]
>>> <top>: FULL EXCEPTION TRACE:
>>> Traceback (most recent call last):
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>>> 115, in main
>>>     main_i()
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
>>> 365, in main_i
>>>     local.service_loop(*[r for r in [remote] if r])
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py",
>>> line 827, in service_loop
>>>     GMaster(self, args[0]).crawl_loop()
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>>> 143, in crawl_loop
>>>     self.crawl()
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>>> 304, in crawl
>>>     xtl = self.xtime(path)
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
>>> 74, in xtime
>>>     xt = rsc.server.xtime(path, self.uuid)
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py",
>>> line 270, in ff
>>>     return f(*a)
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py",
>>> line 365, in xtime
>>>     return struct.unpack('!II', Xattr.lgetxattr(path,
>>> '.'.join([cls.GX_NSPACE, uuid, 'xtime']), 8))
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>>> line 43, in lgetxattr
>>>     return cls._query_xattr( path, siz, 'lgetxattr', attr)
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>>> line 35, in _query_xattr
>>>     cls.raise_oserr()
>>>   File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py",
>>> line 25, in raise_oserr
>>>     raise OSError(errn, os.strerror(errn))
>>> OSError: [Errno 107] Transport endpoint is not connected
>>> [2013-07-30 08:21:44.453290] I [syncdutils:142:finalize] <top>: exiting.
>>> [2013-07-30 08:21:45.411412] D [monitor(monitor):100:monitor] Monitor:
>>> worker died in startup phase
>>> [2013-07-30 08:21:45.411653] I [monitor(monitor):21:set_state] Monitor:
>>> new state: faulty
>>> [2013-07-30 08:21:51.165136] I [syncdutils(monitor):142:finalize] <top>:
>>> exiting.
>>>
>>>
>>>
>>> On Fri, Jul 26, 2013 at 10:42 AM, Tony Maro <tonym at evrichart.com> wrote:
>>>
>>>> Correction: Manually running the command after creating the temp
>>>> directory actually doesn't work, but it doesn't error out it just hangs and
>>>> never connects to the remote server.  Dunno if this is something within
>>>> gsyncd or what...
>>>>
>>>>
>>>> On Fri, Jul 26, 2013 at 10:38 AM, Tony Maro <tonym at evrichart.com>wrote:
>>>>
>>>>> Setting up Geo-replication with an existing 3 TB of data is turning
>>>>> out to be a huge pain.
>>>>>
>>>>> It was working for a bit but would go faulty by the time it hit 1TB
>>>>> synced.  Multiple attempts resulted in the same thing.
>>>>>
>>>>> Now, I don't know what's changed, but it never actually tries to log
>>>>> into the remote server anymore.  Checking "last" logs on the destination
>>>>> shows that it never actually attempts to make the SSH connection.  The
>>>>> geo-replication command is as such:
>>>>>
>>>>> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1
>>>>> start
>>>>>
>>>>> From the log:
>>>>>
>>>>> [2013-07-26 10:26:04.317667] I [gsyncd:354:main_i] <top>: syncing:
>>>>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>>>>> :/data/docstore1
>>>>> [2013-07-26 10:26:08.258853] I [syncdutils(monitor):142:finalize]
>>>>> <top>: exiting.
>>>>> [2013-07-26 10:26:08.259452] E [syncdutils:173:log_raise_exception]
>>>>> <top>: connection to peer is broken
>>>>> *[2013-07-26 10:26:08.260386] E [resource:191:errlog] Popen: command
>>>>> "ssh -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-WlTfNb/gsycnd-ssh-%r@%h:%p
>>>>> root at backup-ds2.gluster /usr/lib/glusterfs/glusterfs/gsyncd
>>>>> --session-owner 24f8c92d-723e-4513-9593-40ef4b7e766a -N --listen --timeout
>>>>> 120 file:///data/docstore1" returned with 143*
>>>>>
>>>>> When I attempt to run the SSH command from the logs directly in the
>>>>> console, ssh replies with:
>>>>>
>>>>> muxserver_listen bind(): No such file or directory
>>>>>
>>>>> And, there's no gsyncd temp directory where specified.  If I manually
>>>>> create that directory and re-run the same command it works.  The problem of
>>>>> course is that the tmp directory is randomly named and starting Gluster
>>>>> geo-rep again will result in a new directory it tries to use.
>>>>>
>>>>> Running Gluster 3.3.1-ubuntu1~precise9
>>>>>
>>>>> Any ideas why this would be happening?  I did find that my Ubuntu
>>>>> packages were trying to access gsyncd in the wrong path so I corrected
>>>>> things.  I've also got auto-ssh login using root so I changed my ssh
>>>>> command (and my global ssh config) to make sure the options would work.
>>>>>  Here's the important geo-rep configs:
>>>>>
>>>>> ssh_command: ssh
>>>>> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
>>>>> gluster_command_dir: /usr/sbin/
>>>>> gluster_params: xlator-option=*-dht.assert-no-child-down=true
>>>>>
>>>>> Thanks,
>>>>> Tony
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> *Tony Maro*
>>>> Chief Information Officer
>>>> EvriChart ? www.evrichart.com
>>>> Advanced Records Management
>>>> Office | 888.801.2020 ? 304.536.1290
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> *Tony Maro*
>>> Chief Information Officer
>>> EvriChart ? www.evrichart.com
>>> Advanced Records Management
>>> Office | 888.801.2020 ? 304.536.1290
>>>
>>>  ------------------------------
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>
>
> --
> Thanks,
>
> *Tony Maro*
> Chief Information Officer
> EvriChart ? www.evrichart.com
> Advanced Records Management
> Office | 888.801.2020 ? 304.536.1290
>
>

-- 
Thanks,

*Tony Maro*
Chief Information Officer
EvriChart ? www.evrichart.com
Advanced Records Management
Office | 888.801.2020 ? 304.536.1290
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130730/8c39bf74/attachment-0001.html>