I have a long time war with geo-replications. First two
battles were won:
1. geo-replication on a cluster with glusterfs-3.4.1
(Centos 6.6, xfs: server1-server2 => geo-replica at
east-server1 )
2. One volume named infra-geo geo-replicated well on a 2nd
cluster with glusterfs-3.7.4 (Centos 7.1, ext4:
server3-server4 => geo-replica at east-server2)
The last battle lasts for two months, and I need your help
to win it :)
Georeplication of second volume, named storage-2 getting
faulty after initial sync of data. It could sync all files
(620GB last time), and became falty. Restart of every and each
gluster service won't affect the result, status is the same -
faulty.
The second volume named storage-2 geo-replicated also on a
2nd cluster with glusterfs-3.7.4 (Centos 7.1, ext4:
server3-server4 => geo-replica at east-server2)
The noticeble things:
(date of creating geo-replica: '2015-10-16 16:03')
1.
/var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.%2Fmnt%2Fsda4%2Fbrick-storage-3-changes.log-20151018
is ordianry and has length of 8.8 MB
The current changelog started at '2015-10-18
17:43:17.117287' has 9.7GB, 34909550 lines of:
E [mem-pool.c:417:mem_get0]
(-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192)
[0x7f885735cbb2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f)
[0x7f885733f55f] -->/lib64/libglusterfs.so.0(mem_get0+0x81)
[0x7f8857371751] ) 0-mem-pool: invalid argument [Invalid
argument]
Only 22MB, has something more uniq:
[2015-10-19 16:42:42.652110] I
[gf-changelog.c:542:gf_changelog_register_generic]
0-gfchangelog: Registering brick: /mnt/sda4/brick-storage-3
[notify filter: 1]
[2015-10-19 16:42:44.106591] T [rpcsvc.c:2298:rpcsvc_init]
0-rpc-service: rx pool: 8
[2015-10-19 16:42:44.106643] T
[rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service:
Authentication enabled: AUTH_GLUSTERFS
[2015-10-19 16:42:44.106648] T
[rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service:
Authentication enabled: AUTH_GLUSTERFS-v2
[2015-10-19 16:42:44.106656] T
[rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service:
Authentication enabled: AUTH_UNIX
[2015-10-19 16:42:44.106661] T
[rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service:
Authentication enabled: AUTH_NULL
[2015-10-19 16:42:44.106664] D [rpcsvc.c:2317:rpcsvc_init]
0-rpc-service: RPC service inited.
[2015-10-19 16:42:44.106673] D
[rpcsvc.c:1874:rpcsvc_program_register] 0-rpc-service: New
program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2015-10-19 16:42:44.106687] D
[rpc-transport.c:288:rpc_transport_load] 0-rpc-transport:
attempt to load file
/usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107068] D [socket.c:3794:socket_init]
0-socket.gfchangelog: disabling nodelay
[2015-10-19 16:42:44.107076] D [socket.c:3845:socket_init]
0-socket.gfchangelog: Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107081] D [socket.c:3928:socket_init]
0-socket.gfchangelog: SSL support on the I/O path is NOT
enabled
[2015-10-19 16:42:44.107084] D [socket.c:3931:socket_init]
0-socket.gfchangelog: SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107093] D [socket.c:3948:socket_init]
0-socket.gfchangelog: using system polling thread
[2015-10-19 16:42:44.107159] D
[rpcsvc.c:1874:rpcsvc_program_register] 0-rpc-service: New
program registered: LIBGFCHANGELOG REBORP, Num: 1886350951,
Ver: 1, Port: 0
[2015-10-19 16:42:44.107182] D
[rpc-clnt.c:989:rpc_clnt_connection_init] 0-gfchangelog:
defaulting frame-timeout to 30mins
[2015-10-19 16:42:44.107192] D
[rpc-clnt.c:1003:rpc_clnt_connection_init] 0-gfchangelog:
disable ping-timeout
[2015-10-19 16:42:44.107200] D
[rpc-transport.c:288:rpc_transport_load] 0-rpc-transport:
attempt to load file
/usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107233] D [socket.c:3794:socket_init]
0-gfchangelog: disabling nodelay
[2015-10-19 16:42:44.107238] D [socket.c:3845:socket_init]
0-gfchangelog: Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107242] D [socket.c:3928:socket_init]
0-gfchangelog: SSL support on the I/O path is NOT enabled
[2015-10-19 16:42:44.107248] D [socket.c:3931:socket_init]
0-gfchangelog: SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107255] D [socket.c:3948:socket_init]
0-gfchangelog: using system polling thread
[2015-10-19 16:42:44.107263] T
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-gfchangelog: attempting
reconnect
[2015-10-19 16:42:44.107267] T
[socket.c:2887:socket_connect] 0-gfchangelog: connecting
0x7f2dbc08c7e0, state=0 gen=0 sock=-1
[2015-10-19 16:42:44.107272] T
[name.c:295:af_unix_client_get_remote_sockaddr] 0-gfchangelog:
using connect-path
/var/run/gluster/changelog-535626c1c4cea4957a54c920f955b1ac.sock
[2015-10-19 16:42:44.107284] T
[name.c:111:af_unix_client_bind] 0-gfchangelog: bind-path not
specified for unix socket, letting connect to assign default
value
[2015-10-19 16:42:46.107494] T
[rpc-clnt.c:1406:rpc_clnt_record] 0-gfchangelog: Auth Info:
pid: 0, uid: 0, gid: 0, owner:
[2015-10-19 16:42:46.107543] T
[rpc-clnt.c:1263:rpc_clnt_record_build_header] 0-rpc-clnt:
Request fraglen 500, payload: 436, rpc hdr: 64
[2015-10-19 16:42:46.107589] T
[rpc-clnt.c:1600:rpc_clnt_submit] 0-rpc-clnt: submitted
request (XID: 0x1 Program: LIBGFCHANGELOG, ProgVers: 1, Proc:
1) to rpc-transport (gfchangelog)
[2015-10-19 16:42:46.107618] D
[rpc-clnt-ping.c:281:rpc_clnt_start_ping] 0-gfchangelog: ping
timeout is 0, returning
[2015-10-19 16:42:46.109489] T
[rpc-clnt.c:663:rpc_clnt_reply_init] 0-gfchangelog: received
rpc message (RPC XID: 0x1 Program: LIBGFCHANGELOG, ProgVers:
1, Proc: 1) from rpc-transport (gfchangelog)
[2015-10-19 16:42:46.146891] I
[gf-history-changelog.c:760:gf_changelog_extract_min_max]
0-gfchangelog: MIN: 1439390689, MAX: 1445272962, TOTAL
CHANGELOGS: 391739
[2015-10-19 16:42:46.147039] I
[gf-history-changelog.c:907:gf_history_changelog]
0-gfchangelog: FINAL: from: 1445023075, to: 1445272962,
changes: 16643
[2015-10-19 16:42:46.148604] D
[gf-history-changelog.c:298:gf_history_changelog_scan]
0-gfchangelog: hist_done 1, is_last_scan: 0
2.
/var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.log
[2015-10-18 10:42:44.890366] W
[master(/mnt/sda4/brick-storage-3):1041:process] _GMaster:
incomplete sync, retrying changelogs: CHANGELOG.1445022009
CHANGELOG.1445022024 CHANGELOG.1445022039 CHANGELOG.1445022054
CHANGELOG.1445022069 CHAN
GELOG.1445022084 CHANGELOG.1445022099 CHANGELOG.1445022114
CHANGELOG.1445022129 CHANGELOG.1445022144 CHANGELOG.1445022159
CHANGELOG.1445022174 CHANGELOG.1445022189 CHANGELOG.1445022204
CHANGELOG.1445022219 CHANGELOG.1445022234 CHANGELOG.
1445022249 CHANGELOG.1445022264 CHANGELOG.1445022279
CHANGELOG.1445022294 CHANGELOG.1445022309 CHANGELOG.1445022324
CHANGELOG.1445022339 CHANGELOG.1445022354 CHANGELOG.1445022369
CHANGELOG.1445022384 CHANGELOG.1445022399 CHANGELOG.144502
2414 CHANGELOG.1445022429 CHANGELOG.1445022444
CHANGELOG.1445022459 CHANGELOG.1445022474 CHANGELOG.1445022489
CHANGELOG.1445022504 CHANGELOG.1445022519 CHANGELOG.1445022534
CHANGELOG.1445022549 CHANGELOG.1445022564 CHANGELOG.1445022579
C
HANGELOG.1445022594 CHANGELOG.1445022609
CHANGELOG.1445022624 CHANGELOG.1445022639 CHANGELOG.1445022654
CHANGELOG.1445022669 CHANGELOG.1445022684 CHANGELOG.1445022699
CHANGELOG.1445022714 CHANGELOG.1445022729 CHANGELOG.1445022744
CHANGEL
OG.1445022759 CHANGELOG.1445022774 CHANGELOG.1445022789
CHANGELOG.1445022805 CHANGELOG.1445022820 CHANGELOG.1445022835
CHANGELOG.1445022850 CHANGELOG.1445022865 CHANGELOG.1445022880
CHANGELOG.1445022895 CHANGELOG.1445022910 CHANGELOG.144
5022925 CHANGELOG.1445022940 CHANGELOG.1445022955
CHANGELOG.1445022970 CHANGELOG.1445022985 CHANGELOG.1445023000
CHANGELOG.1445023015 CHANGELOG.1445023030 CHANGELOG.1445023045
CHANGELOG.1445023060 CHANGELOG.1445023075 CHANGELOG.144502309
0 CHANGELOG.1445023105 CHANGELOG.1445023120
CHANGELOG.1445023135 CHANGELOG.1445023150 CHANGELOG.1445023165
CHANGELOG.1445023180 CHANGELOG.1445023195 CHANGELOG.1445023210
CHANGELOG.1445023225 CHANGELOG.1445023240 CHANGELOG.1445023255
CHAN
GELOG.1445023270 CHANGELOG.1445023285 CHANGELOG.1445023300
CHANGELOG.1445023315 CHANGELOG.1445023330 CHANGELOG.1445023345
CHANGELOG.1445023360 CHANGELOG.1445023375 CHANGELOG.1445023390
CHANGELOG.1445023405 CHANGELOG.1445023420 CHANGELOG.
1445023435 CHANGELOG.1445023450 CHANGELOG.1445023465
CHANGELOG.1445023480 CHANGELOG.1445023495 CHANGELOG.1445023510
CHANGELOG.1445023525 CHANGELOG.1445023540 CHANGELOG.1445023555
CHANGELOG.1445023570 CHANGELOG.1445023585 CHANGELOG.144502
3600 CHANGELOG.1445023615 CHANGELOG.1445023630
CHANGELOG.1445023645 CHANGELOG.1445023660 CHANGELOG.1445023675
CHANGELOG.1445023690 CHANGELOG.1445023705 CHANGELOG.1445023720
CHANGELOG.1445023735 CHANGELOG.1445023750 CHANGELOG.1445023765
C
HANGELOG.1445023780 CHANGELOG.1445023796
CHANGELOG.1445023811 CHANGELOG.1445023826 CHANGELOG.1445023841
CHANGELOG.1445023856 CHANGELOG.1445023871 CHANGELOG.1445023886
CHANGELOG.1445023901 CHANGELOG.1445023916 CHANGELOG.1445023931
CHANGEL
OG.1445023946 CHANGELOG.1445023961 CHANGELOG.1445023976
CHANGELOG.1445023991 CHANGELOG.1445024006 CHANGELOG.1445024021
CHANGELOG.1445024036 CHANGELOG.1445024051 CHANGELOG.1445024066
CHANGELOG.1445024081 CHANGELOG.1445024096 CHANGELOG.144
5024111 CHANGELOG.1445024126 CHANGELOG.1445024141
CHANGELOG.1445024156 CHANGELOG.1445024171 CHANGELOG.1445024186
CHANGELOG.1445024201 CHANGELOG.1445024216 CHANGELOG.1445024231
CHANGELOG.1445024246 CHANGELOG.1445024261 CHANGELOG.144502427
6 CHANGELOG.1445024291 CHANGELOG.1445024306
CHANGELOG.1445024321 CHANGELOG.1445024336 CHANGELOG.1445024351
CHANGELOG.1445024366 CHANGELOG.1445024381 CHANGELOG.1445024396
CHANGELOG.1445024411
[2015-10-18 10:43:02.808736] E
[repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient:
call 24010:140490734180160:1445190182.72 (entry_ops) failed on
peer with OSError
[2015-10-18 10:43:02.809122] E
[syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception]
<top>: FAIL:
Traceback (most recent call last):
File
"/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
165, in main
main_i()
File
"/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
659, in main_i
local.service_loop(*[r for r in [remote] if r])
File
"/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
1451, in service_loop
g2.crawlwrap()
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
591, in crawlwrap
self.crawl()
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1106, in crawl
self.changelogs_batch_process(changes)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1081, in changelogs_batch_process
self.process(batch)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
959, in process
self.process_change(change, done, retry)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
914, in process_change
failures = self.slave.server.entry_ops(entries)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
line 226, in __call__
return self.ins(self.meth, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
line 208, in __call__
raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:02.811510] I
[syncdutils(/mnt/sda4/brick-storage-3):220:finalize]
<top>: exiting.
[2015-10-18 10:43:02.819149] I
[repce(agent):92:service_loop] RepceServer: terminating on
reaching EOF.
[2015-10-18 10:43:02.819616] I
[syncdutils(agent):220:finalize] <top>: exiting.
Since that I get the repeating error:
[2015-10-18 10:43:13.73946] I
[monitor(monitor):221:monitor] Monitor:
------------------------------------------------------------
[2015-10-18 10:43:13.74423] I
[monitor(monitor):222:monitor] Monitor: starting gsyncd worker
[2015-10-18 10:43:13.177963] I
[changelogagent(agent):75:__init__] ChangelogAgent: Agent
listining...
[2015-10-18 10:43:17.93168] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder]
<top>: setting up xsync change detection mode
[2015-10-18 10:43:17.93669] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster:
using 'rsync' as the sync engine
[2015-10-18 10:43:17.95079] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder]
<top>: setting up changelog change detection mode
[2015-10-18 10:43:17.95390] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster:
using 'rsync' as the sync engine
[2015-10-18 10:43:17.96730] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder]
<top>: setting up changeloghistory change detection mode
[2015-10-18 10:43:17.97048] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster:
using 'rsync' as the sync engine
[2015-10-18 10:43:19.915657] I
[master(/mnt/sda4/brick-storage-3):1220:register] _GMaster:
xsync temp directory:
/var/lib/misc/glusterfsd/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2/535626c1c4cea4957a
54c920f955b1ac/xsync
[2015-10-18 10:43:19.916011] I
[resource(/mnt/sda4/brick-storage-3):1432:service_loop]
GLUSTER: Register time: 1445190199
[2015-10-18 10:43:19.935817] I
[master(/mnt/sda4/brick-storage-3):530:crawlwrap] _GMaster:
primary master with volume id
054d6225-39fc-40f5-9604-5ee4b6dcd8b4 ...
[2015-10-18 10:43:19.938358] I
[master(/mnt/sda4/brick-storage-3):539:crawlwrap] _GMaster:
crawl interval: 1 seconds
[2015-10-18 10:43:19.942023] I
[master(/mnt/sda4/brick-storage-3):1135:crawl] _GMaster:
starting history crawl... turns: 1, stime: (1445021993, 0),
etime: 1445190199
[2015-10-18 10:43:20.975270] I
[master(/mnt/sda4/brick-storage-3):1164:crawl] _GMaster:
slave's time: (1445021993, 0)
[2015-10-18 10:43:38.50978] E
[repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient:
call 13364:140337327839040:1445190217.96 (entry_ops) failed on
peer with OSError
[2015-10-18 10:43:38.51417] E
[syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception]
<top>: FAIL:
Traceback (most recent call last):
File
"/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
165, in main
main_i()
File
"/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
659, in main_i
local.service_loop(*[r for r in [remote] if r])
File
"/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
1438, in service_loop
g3.crawlwrap(_oneshot_=True)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
591, in crawlwrap
self.crawl()
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1173, in crawl
self.changelogs_batch_process(changes)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1081, in changelogs_batch_process
self.process(batch)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
959, in process
self.process_change(change, done, retry)
File
"/usr/libexec/glusterfs/python/syncdaemon/master.py", line
914, in process_change
failures = self.slave.server.entry_ops(entries)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
line 226, in __call__
return self.ins(self.meth, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
line 208, in __call__
raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:38.53669] I
[syncdutils(/mnt/sda4/brick-storage-3):220:finalize]
<top>: exiting.
[2015-10-18 10:43:38.56446] I
[repce(agent):92:service_loop] RepceServer: terminating on
reaching EOF.
[2015-10-18 10:43:38.56844] I
[syncdutils(agent):220:finalize] <top>: exiting.
Any suggestions would be very appreciated, thanks
With regards to all community,
--
Rodion