Geo-replication fails to even TRY to start doesnt create /tmp/gsyncd-* dir

tonym at evrichart.com (Tony Maro) · Tue, 30 Jul 2013 08:36:02 -0400

Well I guess I'm carrying on a conversation with myself here, but I've
turned on Debug and gsyncd appears to be crashing in _query_xattr - which
is odd because as mentioned before I was previously able to get this volume
to sync the first 1TB of data before this started, but now it won't even do
that.

To recap, I'm trying to set up geo-rep over SSH.  The Gluster volume is a
mirror setup with two bricks.  The underlying filesystem is ZFS on both
source and destination.  The SSH session appears to be started by the
client, as the auth log on the destination server does log the following:

Jul 30 08:21:37 backup-ds2 sshd[4364]: Accepted publickey for root from
10.200.1.6 port 38865 ssh2
Jul 30 08:21:37 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
opened for user root by (uid=0)
Jul 30 08:21:51 backup-ds2 sshd[4364]: Received disconnect from 10.200.1.6:
11: disconnected by user
Jul 30 08:21:51 backup-ds2 sshd[4364]: pam_unix(sshd:session): session
closed for user root

I begin the geo-rep with the following command:

gluster volume geo-replication docstore1
root at backup-ds2.gluster:/data/docstore1
start

Checking the status will show "starting..." for about 7 seconds and then it
goes "faulty".

The debug gluster.log file on the brick I run the command from shows:

[2013-07-30 08:21:37.224934] I [monitor(monitor):21:set_state] Monitor: new
state: starting...
[2013-07-30 08:21:37.235110] I [monitor(monitor):80:monitor] Monitor:
------------------------------------------------------------
[2013-07-30 08:21:37.235295] I [monitor(monitor):81:monitor] Monitor:
starting gsyncd worker
[2013-07-30 08:21:37.298254] I [gsyncd:354:main_i] <top>: syncing:
gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
:/data/docstore1
[2013-07-30 08:21:37.302464] D [repce:175:push] RepceClient: call
21246:139871057643264:1375186897.3 __repce_version__() ...
[2013-07-30 08:21:39.376665] D [repce:190:__call__] RepceClient: call
21246:139871057643264:1375186897.3 __repce_version__ -> 1.0
[2013-07-30 08:21:39.376894] D [repce:175:push] RepceClient: call
21246:139871057643264:1375186899.38 version() ...
[2013-07-30 08:21:39.378207] D [repce:190:__call__] RepceClient: call
21246:139871057643264:1375186899.38 version -> 1.0
[2013-07-30 08:21:39.393198] D [resource:701:inhibit] DirectMounter:
auxiliary glusterfs mount in place
[2013-07-30 08:21:43.408195] D [resource:747:inhibit] DirectMounter:
auxiliary glusterfs mount prepared
[2013-07-30 08:21:43.408740] D [monitor(monitor):96:monitor] Monitor:
worker seems to be connected (?? racy check)
[2013-07-30 08:21:43.410413] D [repce:175:push] RepceClient: call
21246:139870643156736:1375186903.41 keep_alive(None,) ...
[2013-07-30 08:21:43.411798] D [repce:190:__call__] RepceClient: call
21246:139870643156736:1375186903.41 keep_alive -> 1
[2013-07-30 08:21:44.449774] D [master:220:volinfo_state_machine] <top>:
(None, None) << (None, 24f8c92d) -> (None, 24f8c92d)
[2013-07-30 08:21:44.450082] I [master:284:crawl] GMaster: new master is
24f8c92d-723e-4513-9593-40ef4b7e766a
[2013-07-30 08:21:44.450254] I [master:288:crawl] GMaster: primary master
with volume id 24f8c92d-723e-4513-9593-40ef4b7e766a ...
[2013-07-30 08:21:44.450398] D [master:302:crawl] GMaster: entering .
[2013-07-30 08:21:44.451534] E [syncdutils:178:log_raise_exception] <top>:
glusterfs session went down [ENOTCONN]
[2013-07-30 08:21:44.451721] E [syncdutils:184:log_raise_exception] <top>:
FULL EXCEPTION TRACE:
Traceback (most recent call last):
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
115, in main
    main_i()
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/gsyncd.py", line
365, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
827, in service_loop
    GMaster(self, args[0]).crawl_loop()
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
143, in crawl_loop
    self.crawl()
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line
304, in crawl
    xtl = self.xtime(path)
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/master.py", line 74,
in xtime
    xt = rsc.server.xtime(path, self.uuid)
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
270, in ff
    return f(*a)
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/resource.py", line
365, in xtime
    return struct.unpack('!II', Xattr.lgetxattr(path,
'.'.join([cls.GX_NSPACE, uuid, 'xtime']), 8))
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line
43, in lgetxattr
    return cls._query_xattr( path, siz, 'lgetxattr', attr)
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line
35, in _query_xattr
    cls.raise_oserr()
  File "/usr/lib/glusterfs/glusterfs/python/syncdaemon/libcxattr.py", line
25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 107] Transport endpoint is not connected
[2013-07-30 08:21:44.453290] I [syncdutils:142:finalize] <top>: exiting.
[2013-07-30 08:21:45.411412] D [monitor(monitor):100:monitor] Monitor:
worker died in startup phase
[2013-07-30 08:21:45.411653] I [monitor(monitor):21:set_state] Monitor: new
state: faulty
[2013-07-30 08:21:51.165136] I [syncdutils(monitor):142:finalize] <top>:
exiting.

On Fri, Jul 26, 2013 at 10:42 AM, Tony Maro <tonym at evrichart.com> wrote:

> Correction: Manually running the command after creating the temp directory
> actually doesn't work, but it doesn't error out it just hangs and never
> connects to the remote server.  Dunno if this is something within gsyncd or
> what...
>
>
> On Fri, Jul 26, 2013 at 10:38 AM, Tony Maro <tonym at evrichart.com> wrote:
>
>> Setting up Geo-replication with an existing 3 TB of data is turning out
>> to be a huge pain.
>>
>> It was working for a bit but would go faulty by the time it hit 1TB
>> synced.  Multiple attempts resulted in the same thing.
>>
>> Now, I don't know what's changed, but it never actually tries to log into
>> the remote server anymore.  Checking "last" logs on the destination shows
>> that it never actually attempts to make the SSH connection.  The
>> geo-replication command is as such:
>>
>> gluster volume geo-replication docstore1 root at backup-ds2.gluster:/data/docstore1
>> start
>>
>> From the log:
>>
>> [2013-07-26 10:26:04.317667] I [gsyncd:354:main_i] <top>: syncing:
>> gluster://localhost:docstore1 -> ssh://root at backup-ds2.gluster
>> :/data/docstore1
>> [2013-07-26 10:26:08.258853] I [syncdutils(monitor):142:finalize] <top>:
>> exiting.
>> [2013-07-26 10:26:08.259452] E [syncdutils:173:log_raise_exception]
>> <top>: connection to peer is broken
>> *[2013-07-26 10:26:08.260386] E [resource:191:errlog] Popen: command
>> "ssh -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-WlTfNb/gsycnd-ssh-%r@%h:%p
>> root at backup-ds2.gluster /usr/lib/glusterfs/glusterfs/gsyncd
>> --session-owner 24f8c92d-723e-4513-9593-40ef4b7e766a -N --listen --timeout
>> 120 file:///data/docstore1" returned with 143*
>>
>> When I attempt to run the SSH command from the logs directly in the
>> console, ssh replies with:
>>
>> muxserver_listen bind(): No such file or directory
>>
>> And, there's no gsyncd temp directory where specified.  If I manually
>> create that directory and re-run the same command it works.  The problem of
>> course is that the tmp directory is randomly named and starting Gluster
>> geo-rep again will result in a new directory it tries to use.
>>
>> Running Gluster 3.3.1-ubuntu1~precise9
>>
>> Any ideas why this would be happening?  I did find that my Ubuntu
>> packages were trying to access gsyncd in the wrong path so I corrected
>> things.  I've also got auto-ssh login using root so I changed my ssh
>> command (and my global ssh config) to make sure the options would work.
>>  Here's the important geo-rep configs:
>>
>> ssh_command: ssh
>> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
>> gluster_command_dir: /usr/sbin/
>> gluster_params: xlator-option=*-dht.assert-no-child-down=true
>>
>> Thanks,
>> Tony
>>
>
>
>
> --
> Thanks,
>
> *Tony Maro*
> Chief Information Officer
> EvriChart ? www.evrichart.com
> Advanced Records Management
> Office | 888.801.2020 ? 304.536.1290
>
>

-- 
Thanks,

*Tony Maro*
Chief Information Officer
EvriChart ? www.evrichart.com
Advanced Records Management
Office | 888.801.2020 ? 304.536.1290
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130730/761c534b/attachment.html>