Geo-replication gets faulty after first sync of files. GlusterFS 3.7.4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a long time war with geo-replications. First two battles were won: 
1. geo-replication on a cluster with glusterfs-3.4.1 (Centos 6.6, xfs: server1-server2 => geo-replica at east-server1 )
2. One volume named infra-geo geo-replicated well on a 2nd cluster with glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica at east-server2)

The last battle lasts for two months, and I need your help to win it :)

Georeplication of second volume, named storage-2 getting faulty after initial sync of data. It could sync all files (620GB last time), and became falty. Restart of every and each gluster service won't affect the result, status is the same - faulty. 
The second volume named storage-2 geo-replicated also on a 2nd cluster with glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica at east-server2)

The noticeble things:
(date of creating geo-replica: '2015-10-16 16:03')
1. /var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.%2Fmnt%2Fsda4%2Fbrick-storage-3-changes.log-20151018 is ordianry and has length of 8.8 MB

The current changelog started at '2015-10-18 17:43:17.117287' has 9.7GB, 34909550 lines of: 
E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f885735cbb2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f885733f55f] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f8857371751] ) 0-mem-pool: invalid argument [Invalid argument]

Only 22MB, has something more uniq:
[2015-10-19 16:42:42.652110] I [gf-changelog.c:542:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /mnt/sda4/brick-storage-3 [notify filter: 1]
[2015-10-19 16:42:44.106591] T [rpcsvc.c:2298:rpcsvc_init] 0-rpc-service: rx pool: 8
[2015-10-19 16:42:44.106643] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS
[2015-10-19 16:42:44.106648] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v2
[2015-10-19 16:42:44.106656] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_UNIX
[2015-10-19 16:42:44.106661] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_NULL
[2015-10-19 16:42:44.106664] D [rpcsvc.c:2317:rpcsvc_init] 0-rpc-service: RPC service inited.
[2015-10-19 16:42:44.106673] D [rpcsvc.c:1874:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2015-10-19 16:42:44.106687] D [rpc-transport.c:288:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107068] D [socket.c:3794:socket_init] 0-socket.gfchangelog: disabling nodelay
[2015-10-19 16:42:44.107076] D [socket.c:3845:socket_init] 0-socket.gfchangelog: Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107081] D [socket.c:3928:socket_init] 0-socket.gfchangelog: SSL support on the I/O path is NOT enabled
[2015-10-19 16:42:44.107084] D [socket.c:3931:socket_init] 0-socket.gfchangelog: SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107093] D [socket.c:3948:socket_init] 0-socket.gfchangelog: using system polling thread
[2015-10-19 16:42:44.107159] D [rpcsvc.c:1874:rpcsvc_program_register] 0-rpc-service: New program registered: LIBGFCHANGELOG REBORP, Num: 1886350951, Ver: 1, Port: 0
[2015-10-19 16:42:44.107182] D [rpc-clnt.c:989:rpc_clnt_connection_init] 0-gfchangelog: defaulting frame-timeout to 30mins
[2015-10-19 16:42:44.107192] D [rpc-clnt.c:1003:rpc_clnt_connection_init] 0-gfchangelog: disable ping-timeout
[2015-10-19 16:42:44.107200] D [rpc-transport.c:288:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107233] D [socket.c:3794:socket_init] 0-gfchangelog: disabling nodelay
[2015-10-19 16:42:44.107238] D [socket.c:3845:socket_init] 0-gfchangelog: Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107242] D [socket.c:3928:socket_init] 0-gfchangelog: SSL support on the I/O path is NOT enabled
[2015-10-19 16:42:44.107248] D [socket.c:3931:socket_init] 0-gfchangelog: SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107255] D [socket.c:3948:socket_init] 0-gfchangelog: using system polling thread
[2015-10-19 16:42:44.107263] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gfchangelog: attempting reconnect
[2015-10-19 16:42:44.107267] T [socket.c:2887:socket_connect] 0-gfchangelog: connecting 0x7f2dbc08c7e0, state=0 gen=0 sock=-1
[2015-10-19 16:42:44.107272] T [name.c:295:af_unix_client_get_remote_sockaddr] 0-gfchangelog: using connect-path /var/run/gluster/changelog-535626c1c4cea4957a54c920f955b1ac.sock
[2015-10-19 16:42:44.107284] T [name.c:111:af_unix_client_bind] 0-gfchangelog: bind-path not specified for unix socket, letting connect to assign default value
[2015-10-19 16:42:46.107494] T [rpc-clnt.c:1406:rpc_clnt_record] 0-gfchangelog: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2015-10-19 16:42:46.107543] T [rpc-clnt.c:1263:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 500, payload: 436, rpc hdr: 64
[2015-10-19 16:42:46.107589] T [rpc-clnt.c:1600:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x1 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
[2015-10-19 16:42:46.107618] D [rpc-clnt-ping.c:281:rpc_clnt_start_ping] 0-gfchangelog: ping timeout is 0, returning
[2015-10-19 16:42:46.109489] T [rpc-clnt.c:663:rpc_clnt_reply_init] 0-gfchangelog: received rpc message (RPC XID: 0x1 Program: LIBGFCHANGELOG, ProgVers: 1, Proc: 1) from rpc-transport (gfchangelog)
[2015-10-19 16:42:46.146891] I [gf-history-changelog.c:760:gf_changelog_extract_min_max] 0-gfchangelog: MIN: 1439390689, MAX: 1445272962, TOTAL CHANGELOGS: 391739
[2015-10-19 16:42:46.147039] I [gf-history-changelog.c:907:gf_history_changelog] 0-gfchangelog: FINAL: from: 1445023075, to: 1445272962, changes: 16643
[2015-10-19 16:42:46.148604] D [gf-history-changelog.c:298:gf_history_changelog_scan] 0-gfchangelog: hist_done 1, is_last_scan: 0


 

2.  /var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.log 
[2015-10-18 10:42:44.890366] W [master(/mnt/sda4/brick-storage-3):1041:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1445022009 CHANGELOG.1445022024 CHANGELOG.1445022039 CHANGELOG.1445022054 CHANGELOG.1445022069 CHAN
GELOG.1445022084 CHANGELOG.1445022099 CHANGELOG.1445022114 CHANGELOG.1445022129 CHANGELOG.1445022144 CHANGELOG.1445022159 CHANGELOG.1445022174 CHANGELOG.1445022189 CHANGELOG.1445022204 CHANGELOG.1445022219 CHANGELOG.1445022234 CHANGELOG.
1445022249 CHANGELOG.1445022264 CHANGELOG.1445022279 CHANGELOG.1445022294 CHANGELOG.1445022309 CHANGELOG.1445022324 CHANGELOG.1445022339 CHANGELOG.1445022354 CHANGELOG.1445022369 CHANGELOG.1445022384 CHANGELOG.1445022399 CHANGELOG.144502
2414 CHANGELOG.1445022429 CHANGELOG.1445022444 CHANGELOG.1445022459 CHANGELOG.1445022474 CHANGELOG.1445022489 CHANGELOG.1445022504 CHANGELOG.1445022519 CHANGELOG.1445022534 CHANGELOG.1445022549 CHANGELOG.1445022564 CHANGELOG.1445022579 C
HANGELOG.1445022594 CHANGELOG.1445022609 CHANGELOG.1445022624 CHANGELOG.1445022639 CHANGELOG.1445022654 CHANGELOG.1445022669 CHANGELOG.1445022684 CHANGELOG.1445022699 CHANGELOG.1445022714 CHANGELOG.1445022729 CHANGELOG.1445022744 CHANGEL
OG.1445022759 CHANGELOG.1445022774 CHANGELOG.1445022789 CHANGELOG.1445022805 CHANGELOG.1445022820 CHANGELOG.1445022835 CHANGELOG.1445022850 CHANGELOG.1445022865 CHANGELOG.1445022880 CHANGELOG.1445022895 CHANGELOG.1445022910 CHANGELOG.144
5022925 CHANGELOG.1445022940 CHANGELOG.1445022955 CHANGELOG.1445022970 CHANGELOG.1445022985 CHANGELOG.1445023000 CHANGELOG.1445023015 CHANGELOG.1445023030 CHANGELOG.1445023045 CHANGELOG.1445023060 CHANGELOG.1445023075 CHANGELOG.144502309
0 CHANGELOG.1445023105 CHANGELOG.1445023120 CHANGELOG.1445023135 CHANGELOG.1445023150 CHANGELOG.1445023165 CHANGELOG.1445023180 CHANGELOG.1445023195 CHANGELOG.1445023210 CHANGELOG.1445023225 CHANGELOG.1445023240 CHANGELOG.1445023255 CHAN
GELOG.1445023270 CHANGELOG.1445023285 CHANGELOG.1445023300 CHANGELOG.1445023315 CHANGELOG.1445023330 CHANGELOG.1445023345 CHANGELOG.1445023360 CHANGELOG.1445023375 CHANGELOG.1445023390 CHANGELOG.1445023405 CHANGELOG.1445023420 CHANGELOG.
1445023435 CHANGELOG.1445023450 CHANGELOG.1445023465 CHANGELOG.1445023480 CHANGELOG.1445023495 CHANGELOG.1445023510 CHANGELOG.1445023525 CHANGELOG.1445023540 CHANGELOG.1445023555 CHANGELOG.1445023570 CHANGELOG.1445023585 CHANGELOG.144502
3600 CHANGELOG.1445023615 CHANGELOG.1445023630 CHANGELOG.1445023645 CHANGELOG.1445023660 CHANGELOG.1445023675 CHANGELOG.1445023690 CHANGELOG.1445023705 CHANGELOG.1445023720 CHANGELOG.1445023735 CHANGELOG.1445023750 CHANGELOG.1445023765 C
HANGELOG.1445023780 CHANGELOG.1445023796 CHANGELOG.1445023811 CHANGELOG.1445023826 CHANGELOG.1445023841 CHANGELOG.1445023856 CHANGELOG.1445023871 CHANGELOG.1445023886 CHANGELOG.1445023901 CHANGELOG.1445023916 CHANGELOG.1445023931 CHANGEL
OG.1445023946 CHANGELOG.1445023961 CHANGELOG.1445023976 CHANGELOG.1445023991 CHANGELOG.1445024006 CHANGELOG.1445024021 CHANGELOG.1445024036 CHANGELOG.1445024051 CHANGELOG.1445024066 CHANGELOG.1445024081 CHANGELOG.1445024096 CHANGELOG.144
5024111 CHANGELOG.1445024126 CHANGELOG.1445024141 CHANGELOG.1445024156 CHANGELOG.1445024171 CHANGELOG.1445024186 CHANGELOG.1445024201 CHANGELOG.1445024216 CHANGELOG.1445024231 CHANGELOG.1445024246 CHANGELOG.1445024261 CHANGELOG.144502427
6 CHANGELOG.1445024291 CHANGELOG.1445024306 CHANGELOG.1445024321 CHANGELOG.1445024336 CHANGELOG.1445024351 CHANGELOG.1445024366 CHANGELOG.1445024381 CHANGELOG.1445024396 CHANGELOG.1445024411
[2015-10-18 10:43:02.808736] E [repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call 24010:140490734180160:1445190182.72 (entry_ops) failed on peer with OSError
[2015-10-18 10:43:02.809122] E [syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1451, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1106, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1081, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:02.811510] I [syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
[2015-10-18 10:43:02.819149] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-10-18 10:43:02.819616] I [syncdutils(agent):220:finalize] <top>: exiting.

Since that I get the repeating error:
[2015-10-18 10:43:13.73946] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------
[2015-10-18 10:43:13.74423] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker
[2015-10-18 10:43:13.177963] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-10-18 10:43:13.178881] I [gsyncd(/mnt/sda4/brick-storage-3):649:main_i] <top>: syncing: gluster://localhost:storage-2 -> ssh://root@east-storage2:gluster://localhost:storage-2
[2015-10-18 10:43:17.93168] I [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-10-18 10:43:17.93669] I [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-10-18 10:43:17.95079] I [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-10-18 10:43:17.95390] I [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-10-18 10:43:17.96730] I [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-10-18 10:43:17.97048] I [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-10-18 10:43:19.915657] I [master(/mnt/sda4/brick-storage-3):1220:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2/535626c1c4cea4957a
54c920f955b1ac/xsync
[2015-10-18 10:43:19.916011] I [resource(/mnt/sda4/brick-storage-3):1432:service_loop] GLUSTER: Register time: 1445190199
[2015-10-18 10:43:19.935817] I [master(/mnt/sda4/brick-storage-3):530:crawlwrap] _GMaster: primary master with volume id 054d6225-39fc-40f5-9604-5ee4b6dcd8b4 ...
[2015-10-18 10:43:19.938358] I [master(/mnt/sda4/brick-storage-3):539:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-10-18 10:43:19.942023] I [master(/mnt/sda4/brick-storage-3):1135:crawl] _GMaster: starting history crawl... turns: 1, stime: (1445021993, 0), etime: 1445190199
[2015-10-18 10:43:20.975270] I [master(/mnt/sda4/brick-storage-3):1164:crawl] _GMaster: slave's time: (1445021993, 0)
[2015-10-18 10:43:38.50978] E [repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call 13364:140337327839040:1445190217.96 (entry_ops) failed on peer with OSError
[2015-10-18 10:43:38.51417] E [syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop
    g3.crawlwrap(_oneshot_=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1173, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1081, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:38.53669] I [syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
[2015-10-18 10:43:38.56446] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-10-18 10:43:38.56844] I [syncdutils(agent):220:finalize] <top>: exiting.

Any suggestions would be very appreciated, thanks

With regards to all community,
--
Rodion


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux