Re: glusterfs geo-replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I am still getting the same issue, and looking further into this, I have modified syncdutils.py slightly on the slave node to print more information on the nature of this OSError:

[2015-09-30 12:19:25.8531] I [gsyncd(slave):649:main_i] <top>: syncing: gluster://localhost:homegs
[2015-09-30 12:19:26.79925] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-09-30 12:19:30.723542] W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp', '.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt']) from /...[Errno 16] Device or resource busy
[2015-09-30 12:19:30.723845] I [syncdutils(slave):483:errno_wrap] <top>: we could try deleting '.glusterfs/a7/80/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp'
[2015-09-30 12:19:30.723974] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 725, in entry_ops
    [ENOENT, EEXIST], [ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy
[2015-09-30 12:19:30.770679] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-09-30 12:19:30.770944] I [syncdutils(slave):220:finalize] <top>: exiting.

Below is a list of errors:

W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/93b6282a-a47f-40fb-b83e-efec5b9c1a4e/.viminfo.tmp', '.gfid/93b6282a-a47f-40fb-b83e-efec5b9c1a4e/.viminfo']) from /...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/9c4f77d0-0855-4607-8d65-1a52b3cdbbce/gr_rsync_homes_new.sh', '.gfid/9c4f77d0-0855-4607-8d65-1a52b3cdbbce/gr_rsync_homes_new.sh~']) from /...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp', '.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt']) from /...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal.tmp', '.gfid/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal']) from /...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/gluster-volume-info.debug.pl', '.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/gluster-volume-info.debug.pl~'])...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for <built-in function rename>(['.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/gluster-volume-info.debug.pl', '.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/gluster-volume-info.debug.pl~']) from /...[Errno 16] Device or resource busy

I am not entirely clear why rename() would return this ‘Device or resource busy’.  Also I’m not sure what are these ‘.gfid’ paths exactly: is this something transient? Where does it live exactly? 

I tried deleting the affected files (e.g. a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal.tmp) in .glusterfs/, and so far that allowed the geo-replication to resume… until it hits another one of these.

In some cases I found that this gfid points to a directory that doesn’t exist, e.g.:

ls-alh /gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708

lrwxrwxrwx 1 root root 71 Sep 30 11:26 /gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708 -> ../../0d/be/0dbe0730-5ba9-45d5-91e5-aeb069ab08bf/watchdog_unit_test.dir

but the link is broken.

In that case deleting the /gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708 allowed the geo-replication to carry on.

Obviously it doesn’t work if I have to do this 50 times a day…  

Any idea what could cause this and how to fix it properly?

Thx,
Thib.

On 26 Aug 2015 12:58 pm, "Thibault Godouet" <tibo92@xxxxxxxxxxx> wrote:

Hi,

Given that this is an 'OSError', I should probably have said we use Redhat 6.6 64bit and XFS as the brick filesystem.

Has anyone any input on how to troubleshoot this?

Thanks,
Thibault.

On 24 Aug 2015 4:03 pm, "Thibault Godouet" <Thibault.Godouet@xxxxxxxxxxxxxxx> wrote:

I have had multiple issues with geo-replication.  It seems to work OK initially, the replica gets up to date, and not long after (e.g. a couple of days), the replication goes into a faulty state and won't get out of it.


I have tried a few times now, and last attempt I re-created the slave volume and setup the replication again.  Same symptoms again.

 

I use Gluster 3.7.3, and you will find my setup and log messages at the bottom of the email.

 

Any idea what could cause this and how to fix it?

Thanks,
Thibault.

ps: my setup and log messages:

 

Master:

 

Volume Name: home
Type: Replicate
Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server4.uberit.net:/gluster/home-brick-1
Brick2: server5.uberit.net:/gluster/home-brick-1
Options Reconfigured:
performance.readdir-ahead: on
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

 

Slave:

Volume Name: homegs
Type: Distribute
Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick
Options Reconfigured:
performance.readdir-ahead: on

 

The geo-replication status and config (I think I ended up with only defaults values) are:

 

# gluster volume geo-replication home ssh://remoteserver1::homegs status

MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER    SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS    LAST_SYNCED
---------------------------------------------------------------------------------------------------------------------------------------------------------
server5          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
server4          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A

 

# gluster volume geo-replication home ssh://remoteserver1::homegs config
special_sync_mode: partial
state_socket_unencoded: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket
gluster_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: false
change_detector: changelog
gluster_command_dir: /usr/sbin/
georep_session_working_dir: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/
state_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status
remote_gsyncd: /nonexistent/gsyncd
session_owner: 2299a204-a1dc-449d-8556-bc65197373c7
changelog_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log
socketdir: /var/run/gluster
working_dir: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs
state_detail_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file: /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid
log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log
gluster_params: aux-gfid-mount acl
volume_id: 2299a204-a1dc-449d-8556-bc65197373c7

 

The logs look like on the master on server1:

 

[2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------
[2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker
[2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root@xxxxxxxxxxxxxxxxxxxxxxxxx:gluster://localhost:homegs
[2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-08-24 15:21:11.255237] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-08-24 15:21:11.255532] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.256570] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-08-24 15:21:11.256726] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.257345] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-08-24 15:21:11.257534] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:13.333628] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
[2015-08-24 15:21:13.333870] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073
[2015-08-24 15:21:13.401132] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
[2015-08-24 15:21:13.412795] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl] _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)
[2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl] _GMaster: slave's time: (1440411353, 0)
[2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on peer with OSError
[2015-08-24 15:21:14.891124] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop
    g3.crawlwrap(_oneshot_=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1165, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1074, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error
[2015-08-24 15:21:14.892291] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
[2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>: exiting.
[2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase

 

and on master server2:

 

[2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------
[2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker
[2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root@xxxxxxxxxxxxxxxxxxxxxxxx:gluster://localhost:homegs
[2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-08-24 15:21:11.9820] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.10946] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.11744] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:13.59192] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
[2015-08-24 15:21:13.59454] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073
[2015-08-24 15:21:13.113203] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
[2015-08-24 15:21:13.122018] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on peer with OSError
[2015-08-24 15:21:23.210119] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438, in keep_alive
    cls.slave.server.keep_alive(vi)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 22] Invalid argument
[2015-08-24 15:21:23.210975] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
[2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>: exiting.
[2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase

 

and on the slave (in a different timezone, one hour behind):

 

[2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 731, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error
[2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>: exiting.


 

--------------
G-RESEARCH believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis of any claim, demand or cause of action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this e-mail address will be logged by G-RESEARCH and are subject to archival storage, monitoring, review and disclosure.
G-RESEARCH is the trading name of Trenchant Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Trenchant Limited is a company registered in England with company number 08127121.
--------------
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux