Geo-Replication just won't go past 1TB - half the replication process crashing?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm trying to create a new Geo-Rep of about 3 TB of data currently stored
in a 2 brick mirror config. Obviously the geo-rep destination is a third
server.

This is my 150th attempt.  Okay, maybe not that far, but it's pretty darn
bad.

Replication works fine until I hit around 1TB of data sync'd, then it just
stalls.  For the past two days it hasn't gone past 1050156672 bytes sync'd
to the destination server.

I did some digging in the logs and it looks like the brick that's running
the geo-rep process thinks it's syncing:

[2013-09-05 09:45:37.354831] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000863.enc ...
[2013-09-05 09:45:37.358669] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/0000083b.enc ...
[2013-09-05 09:45:37.362251] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/0000087b.enc ...
[2013-09-05 09:45:37.366027] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000834.enc ...
[2013-09-05 09:45:37.369752] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000845.enc ...
[2013-09-05 09:45:37.373528] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000864.enc ...
[2013-09-05 09:45:37.377037] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/0000087f.enc ...
[2013-09-05 09:45:37.391432] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000897.enc ...
[2013-09-05 09:45:37.395059] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000829.enc ...
[2013-09-05 09:45:37.398725] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/00000816.enc ...
[2013-09-05 09:45:37.402559] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/000008cc.enc ...
[2013-09-05 09:45:37.406450] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/000008d2.enc ...
[2013-09-05 09:45:37.410310] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/000008df.enc ...
[2013-09-05 09:45:37.414344] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/00/00/08/000008bd.enc ...
[2013-09-05 09:45:37.438173] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/volume.info ...
[2013-09-05 09:45:37.441675] D [master:386:crawl] GMaster: syncing
./evds3/Sky_Main_66/volume.enc ...

But, *those files never appear on the destination server,* however the
containing folders are there, just empty.

Also, the other log file (...gluster.log) in the geo-replication log folder
that matches the destination stopped updating when the syncing stopped
apparently.  It's last timestamp is from the 2nd, which is the last time
data transferred.

The last bit from that log file is as such:

+------------------------------------------------------------------------------+
[2013-09-02 06:37:50.109730] I [rpc-clnt.c:1654:rpc_clnt_reconfig]
0-docstore1-client-1: changing port to 24009 (from 0)
[2013-09-02 06:37:50.109857] I [rpc-clnt.c:1654:rpc_clnt_reconfig]
0-docstore1-client-0: changing port to 24009 (from 0)
[2013-09-02 06:37:54.097468] I
[client-handshake.c:1614:select_server_supported_programs]
0-docstore1-client-1: Using Program GlusterFS 3.3.2, Num (1298437), Version
(330)
[2013-09-02 06:37:54.097973] I
[client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-1:
Connected to 10.200.1.6:24009, attached to remote volume '/data/docstore1'.
[2013-09-02 06:37:54.098005] I
[client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-1: Server
and Client lk-version numbers are not same, reopening the fds
[2013-09-02 06:37:54.098094] I [afr-common.c:3685:afr_notify]
0-docstore1-replicate-0: Subvolume 'docstore1-client-1' came back up; going
online.
[2013-09-02 06:37:54.098274] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-1:
Server lk version = 1
[2013-09-02 06:37:54.098619] I
[client-handshake.c:1614:select_server_supported_programs]
0-docstore1-client-0: Using Program GlusterFS 3.3.2, Num (1298437), Version
(330)
[2013-09-02 06:37:54.099191] I
[client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-0:
Connected to 10.200.1.5:24009, attached to remote volume '/data/docstore1'.
[2013-09-02 06:37:54.099222] I
[client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-0: Server
and Client lk-version numbers are not same, reopening the fds
[2013-09-02 06:37:54.105891] I [fuse-bridge.c:4191:fuse_graph_setup]
0-fuse: switched to graph 0
[2013-09-02 06:37:54.106039] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-0:
Server lk version = 1
[2013-09-02 06:37:54.106179] I [fuse-bridge.c:3376:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel
7.17
[2013-09-02 06:37:54.108766] I
[afr-common.c:2022:afr_set_root_inode_on_first_lookup]
0-docstore1-replicate-0: added root inode


This is driving me nuts - I've been working on getting Geo-Replication
working for over 2 months now without any success.

Status on the geo-rep shows OK:

root at gfs6:~# gluster volume geo-replication docstore1
ssh://root at backup-ds2.gluster:/data/docstore1 status
MASTER               SLAVE
 STATUS
--------------------------------------------------------------------------------
docstore1            ssh://root at backup-ds2.gluster:/data/docstore1      OK

root at gfs6:~#

Here's the config:

root at gfs6:~# gluster volume geo-replication docstore1
ssh://root at backup-ds2.gluster:/data/docstore1 config
log_level: DEBUG
gluster_log_file:
/var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem
session_owner: 24f8c92d-723e-4513-9593-40ef4b7e766a
remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
state_file:
/var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.status
gluster_command_dir: /usr/sbin/
pid_file:
/var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.pid
log_file:
/var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.log
gluster_params: xlator-option=*-dht.assert-no-child-down=true
root at gfs6:~#

I'm running Ubuntu packages 3.3.2-ubuntu1-precise2 from the ppa.  Any ideas
for why it's stalling?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130905/ed47b6f4/attachment.html>


[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux