Faulty geo-replication

Thibault Godouet <tibo92@xxxxxxxxxxx> · Mon, 24 Aug 2015 16:39:51 +0100

I have had multiple issues with geo-replication.  It seems to work OK initially, the replica gets up to date, and not long after (e.g. a couple of days), the replication goes into a faulty state and won't get out of it.

I have tried a few times now, and last attempt I re-created the slave volume and setup the replication again.  Same symptoms again.

I use Gluster 3.7.3, and you will find my setup and log messages at the bottom of the email.

Any idea what could cause this and how to fix it?
Thanks,

Thibault.
ps: my setup and log messages:

Master:

Volume Name: home

Type: Replicate

Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7

Status: Started

Number of Bricks: 1 x 2 = 2

Transport-type: tcp

Bricks:

Brick1: server4.uberit.net:/gluster/home-brick-1

Brick2: server5.uberit.net:/gluster/home-brick-1

Options Reconfigured:

performance.readdir-ahead: on

geo-replication.indexing: on

geo-replication.ignore-pid-check: on

changelog.changelog: on

Slave:
Volume Name: homegs

Type: Distribute

Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94

Status: Started

Number of Bricks: 1

Transport-type: tcp

Bricks:

Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick

Options Reconfigured:

performance.readdir-ahead: on

The geo-replication status and config (I think I ended up with only defaults values) are:

# gluster volume geo-replication home ssh://remoteserver1::homegs status
MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER    SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS    LAST_SYNCED

---------------------------------------------------------------------------------------------------------------------------------------------------------

server5          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A

server4          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A

# gluster volume geo-replication home ssh://remoteserver1::homegs config

special_sync_mode: partial

state_socket_unencoded: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket

gluster_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log

ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem

ignore_deletes: false

change_detector: changelog

gluster_command_dir: /usr/sbin/

georep_session_working_dir: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/

state_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status

remote_gsyncd: /nonexistent/gsyncd

session_owner: 2299a204-a1dc-449d-8556-bc65197373c7

changelog_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log

socketdir: /var/run/gluster

working_dir: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs

state_detail_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status

ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem

pid_file: /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid

log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log

gluster_params: aux-gfid-mount acl

volume_id: 2299a204-a1dc-449d-8556-bc65197373c7

The logs look like on the master on server1:

[2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------

[2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker

[2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root@xxxxxxxxxxxxxxxxxxxxxxxxx:gluster://localhost:homegs

[2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...

[2015-08-24 15:21:11.255237] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode

[2015-08-24 15:21:11.255532] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:11.256570] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode

[2015-08-24 15:21:11.256726] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:11.257345] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode

[2015-08-24 15:21:11.257534] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:13.333628] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync

[2015-08-24 15:21:13.333870] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073

[2015-08-24 15:21:13.401132] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...

[2015-08-24 15:21:13.412795] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds

[2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl] _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)

[2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl] _GMaster: slave's time: (1440411353, 0)

[2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on peer with OSError

[2015-08-24 15:21:14.891124] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main

    main_i()

  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i

    local.service_loop(*[r for r in [remote] if r])

  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop

    g3.crawlwrap(_oneshot_=True)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584, in crawlwrap

    self.crawl()

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1165, in crawl

    self.changelogs_batch_process(changes)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1074, in changelogs_batch_process

    self.process(batch)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952, in process

    self.process_change(change, done, retry)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907, in process_change

    failures = self.slave.server.entry_ops(entries)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__

    return self.ins(self.meth, *a)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__

    raise res

OSError: [Errno 5] Input/output error

[2015-08-24 15:21:14.892291] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.

[2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.

[2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>: exiting.

[2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase

and on master server2:

[2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------

[2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker

[2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root@xxxxxxxxxxxxxxxxxxxxxxxx:gluster://localhost:homegs

[2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...

[2015-08-24 15:21:11.9820] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode

[2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:11.10946] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode

[2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:11.11744] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode

[2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine

[2015-08-24 15:21:13.59192] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync

[2015-08-24 15:21:13.59454] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073

[2015-08-24 15:21:13.113203] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...

[2015-08-24 15:21:13.122018] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds

[2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on peer with OSError

[2015-08-24 15:21:23.210119] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap

    tf(*aa)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438, in keep_alive

    cls.slave.server.keep_alive(vi)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__

    return self.ins(self.meth, *a)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__

    raise res

OSError: [Errno 22] Invalid argument

[2015-08-24 15:21:23.210975] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.

[2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.

[2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>: exiting.

[2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase

and on the slave (in a different timezone, one hour behind):

[2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER: slave listening

[2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call failed:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker

    res = getattr(self.obj, rmeth)(*in_data[2:])

  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 731, in entry_ops

    [ESTALE, EINVAL])

  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap

    return call(*arg)

  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr

    cls.raise_oserr()

  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr

    raise OSError(errn, os.strerror(errn))

OSError: [Errno 5] Input/output error

[2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.

[2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>: exiting.

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users