Geo-Replication just won't go past 1TB - half the replication process crashing?

yknev.shankar at gmail.com (Venky Shankar) · Thu, 5 Sep 2013 19:39:25 +0530

Could you also provide the slave logs? (log location on the
slave: /var/log/glusterfs/geo-replication-slaves)

Thanks,
-venky

On Thu, Sep 5, 2013 at 7:29 PM, Tony Maro <tonym at evrichart.com> wrote:

> I'm trying to create a new Geo-Rep of about 3 TB of data currently stored
> in a 2 brick mirror config. Obviously the geo-rep destination is a third
> server.
>
> This is my 150th attempt.  Okay, maybe not that far, but it's pretty darn
> bad.
>
> Replication works fine until I hit around 1TB of data sync'd, then it just
> stalls.  For the past two days it hasn't gone past 1050156672 bytes sync'd
> to the destination server.
>
> I did some digging in the logs and it looks like the brick that's running
> the geo-rep process thinks it's syncing:
>
> [2013-09-05 09:45:37.354831] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000863.enc ...
> [2013-09-05 09:45:37.358669] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/0000083b.enc ...
> [2013-09-05 09:45:37.362251] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/0000087b.enc ...
> [2013-09-05 09:45:37.366027] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000834.enc ...
> [2013-09-05 09:45:37.369752] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000845.enc ...
> [2013-09-05 09:45:37.373528] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000864.enc ...
> [2013-09-05 09:45:37.377037] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/0000087f.enc ...
> [2013-09-05 09:45:37.391432] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000897.enc ...
> [2013-09-05 09:45:37.395059] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000829.enc ...
> [2013-09-05 09:45:37.398725] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/00000816.enc ...
> [2013-09-05 09:45:37.402559] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/000008cc.enc ...
> [2013-09-05 09:45:37.406450] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/000008d2.enc ...
> [2013-09-05 09:45:37.410310] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/000008df.enc ...
> [2013-09-05 09:45:37.414344] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/00/00/08/000008bd.enc ...
> [2013-09-05 09:45:37.438173] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/volume.info ...
> [2013-09-05 09:45:37.441675] D [master:386:crawl] GMaster: syncing
> ./evds3/Sky_Main_66/volume.enc ...
>
> But, *those files never appear on the destination server,* however the
> containing folders are there, just empty.
>
> Also, the other log file (...gluster.log) in the geo-replication log
> folder that matches the destination stopped updating when the syncing
> stopped apparently.  It's last timestamp is from the 2nd, which is the last
> time data transferred.
>
> The last bit from that log file is as such:
>
>
> +------------------------------------------------------------------------------+
> [2013-09-02 06:37:50.109730] I [rpc-clnt.c:1654:rpc_clnt_reconfig]
> 0-docstore1-client-1: changing port to 24009 (from 0)
> [2013-09-02 06:37:50.109857] I [rpc-clnt.c:1654:rpc_clnt_reconfig]
> 0-docstore1-client-0: changing port to 24009 (from 0)
> [2013-09-02 06:37:54.097468] I
> [client-handshake.c:1614:select_server_supported_programs]
> 0-docstore1-client-1: Using Program GlusterFS 3.3.2, Num (1298437), Version
> (330)
> [2013-09-02 06:37:54.097973] I
> [client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-1:
> Connected to 10.200.1.6:24009, attached to remote volume
> '/data/docstore1'.
> [2013-09-02 06:37:54.098005] I
> [client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-1: Server
> and Client lk-version numbers are not same, reopening the fds
> [2013-09-02 06:37:54.098094] I [afr-common.c:3685:afr_notify]
> 0-docstore1-replicate-0: Subvolume 'docstore1-client-1' came back up; going
> online.
> [2013-09-02 06:37:54.098274] I
> [client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-1:
> Server lk version = 1
> [2013-09-02 06:37:54.098619] I
> [client-handshake.c:1614:select_server_supported_programs]
> 0-docstore1-client-0: Using Program GlusterFS 3.3.2, Num (1298437), Version
> (330)
> [2013-09-02 06:37:54.099191] I
> [client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-0:
> Connected to 10.200.1.5:24009, attached to remote volume
> '/data/docstore1'.
> [2013-09-02 06:37:54.099222] I
> [client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-0: Server
> and Client lk-version numbers are not same, reopening the fds
> [2013-09-02 06:37:54.105891] I [fuse-bridge.c:4191:fuse_graph_setup]
> 0-fuse: switched to graph 0
> [2013-09-02 06:37:54.106039] I
> [client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-0:
> Server lk version = 1
> [2013-09-02 06:37:54.106179] I [fuse-bridge.c:3376:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel
> 7.17
> [2013-09-02 06:37:54.108766] I
> [afr-common.c:2022:afr_set_root_inode_on_first_lookup]
> 0-docstore1-replicate-0: added root inode
>
>
> This is driving me nuts - I've been working on getting Geo-Replication
> working for over 2 months now without any success.
>
> Status on the geo-rep shows OK:
>
> root at gfs6:~# gluster volume geo-replication docstore1
> ssh://root at backup-ds2.gluster:/data/docstore1 status
> MASTER               SLAVE
>  STATUS
>
> --------------------------------------------------------------------------------
> docstore1            ssh://root at backup-ds2.gluster:/data/docstore1
>  OK
> root at gfs6:~#
>
> Here's the config:
>
> root at gfs6:~# gluster volume geo-replication docstore1
> ssh://root at backup-ds2.gluster:/data/docstore1 config
> log_level: DEBUG
> gluster_log_file:
> /var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.gluster.log
> ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
> /var/lib/glusterd/geo-replication/secret.pem
> session_owner: 24f8c92d-723e-4513-9593-40ef4b7e766a
> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd
> state_file:
> /var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.status
> gluster_command_dir: /usr/sbin/
> pid_file:
> /var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.pid
> log_file:
> /var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.log
> gluster_params: xlator-option=*-dht.assert-no-child-down=true
> root at gfs6:~#
>
> I'm running Ubuntu packages 3.3.2-ubuntu1-precise2 from the ppa.  Any
> ideas for why it's stalling?
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130905/d5042e19/attachment.html>