Correction: Different between the two local bricks is more likely a result of the ZFS snapshot process I've run on the larger of them. On Fri, Sep 6, 2013 at 11:39 AM, Tony Maro <tonym at evrichart.com> wrote: > More stats in case you need them: > > Grand total of the /data/docstore1 tree that's being sync'd is about 8.5 > million files. > About 1 KB of new data has appeared on the destination server since we > started this discussion yesterday... > > > tony at backup-ds2:~$ df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda1 768989240 1841364 728085452 1% / > udev 16450080 4 16450076 1% /dev > tmpfs 6583556 348 6583208 1% /run > none 5120 0 5120 0% /run/lock > none 16458888 0 16458888 0% /run/shm > data 6601579264 0 6601579264 0% /data > data/docstore1 7651735936 1050156672 6601579264 14% /data/docstore1 > tony at backup-ds2:~$ date > Thu Sep 5 09:36:24 EDT 2013 > > tony at backup-ds2:~$ df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda1 768989240 1842524 728084292 1% / > udev 16450080 4 16450076 1% /dev > tmpfs 6583556 348 6583208 1% /run > none 5120 0 5120 0% /run/lock > none 16458888 0 16458888 0% /run/shm > data 6601577984 0 6601577984 0% /data > data/docstore1 7651735808 1050157824 6601577984 14% /data/docstore1 > tony at backup-ds2:~$ date > Fri Sep 6 11:34:07 EDT 2013 > tony at backup-ds2:~$ > > The source data I'm attempting to geo-replicate is 3.3 TB. Mirroring > between the local bricks seems to be working fine. They are within 20KB of > drive usage between each other for the /data/docstore1 partition. I > attribute that to the new geo-rep info on gfs6 making it larger by a bit? > Also the data is constantly changing, so maybe it's a bit of replication > lag. > > > > On Fri, Sep 6, 2013 at 11:29 AM, Tony Maro <tonym at evrichart.com> wrote: > >> ./evds3 contains 1,262,970 files. No more than 255 files or folders per >> subdirectory. >> >> Network latency is almost nonexistent - the servers are currently jacked >> into the same switch on a separate network segment for the initial sync: >> >> tony at gfs6:~$ ping backup-ds2.gluster >> PING backup-ds2.gluster (10.200.1.12) 56(84) bytes of data. >> 64 bytes from backup-ds2.gluster (10.200.1.12): icmp_req=1 ttl=64 >> time=0.245 ms >> 64 bytes from backup-ds2.gluster (10.200.1.12): icmp_req=2 ttl=64 >> time=0.197 ms >> 64 bytes from backup-ds2.gluster (10.200.1.12): icmp_req=3 ttl=64 >> time=0.174 ms >> 64 bytes from backup-ds2.gluster (10.200.1.12): icmp_req=4 ttl=64 >> time=0.221 ms >> 64 bytes from backup-ds2.gluster (10.200.1.12): icmp_req=5 ttl=64 >> time=0.163 ms >> >> >> >> >> >> On Fri, Sep 6, 2013 at 5:13 AM, Venky Shankar <yknev.shankar at gmail.com>wrote: >> >>> From the trace logs it looks like its still crawling (the lgetxattr() on >>> the master and lgetxattr() on the slave). How many files are under './evds3' >>> ? >>> >>> Further more, what's the latency b/w the two sites? >>> >>> >>> On Thu, Sep 5, 2013 at 11:31 PM, Tony Maro <tonym at evrichart.com> wrote: >>> >>>> No rsync process is running. I tested for it several times over 10 >>>> seconds. >>>> >>>> Here's the trace on feedback. I put it on TinyPaste instead of in >>>> everyone's inbox out of courtesy: http://tny.cz/505b09c8 >>>> >>>> Thanks, >>>> Tony >>>> >>>> >>>> On Thu, Sep 5, 2013 at 12:24 PM, Venky Shankar <yknev.shankar at gmail.com >>>> > wrote: >>>> >>>>> So, nothing fishy in slave logs too. >>>>> >>>>> Debugging this would need more information. Lets start with: >>>>> >>>>> 1. Is there any rsync process that is running? (ps auxww | grep rsync) >>>>> >>>>> 2. I would need strace logs for the worker process (the process which >>>>> scans the filesystem and invokes rsync to transfer data). Something like >>>>> this would do: >>>>> >>>>> - get the pid of the worker process >>>>> # ps auxww | grep feedback (a gsyncd.py python process) >>>>> >>>>> - trace the pid above >>>>> # strace -s 500 -o /tmp/worker.log -f -p <pid> >>>>> >>>>> Let the trace run for about 20 seconds. Lets see what's in >>>>> /tmp/worker.log >>>>> >>>>> Thanks, >>>>> -venky >>>>> >>>>> >>>>> On Thu, Sep 5, 2013 at 7:56 PM, Tony Maro <tonym at evrichart.com> wrote: >>>>> >>>>>> Looks like the slave stopped logging things just before the sending >>>>>> brick did: >>>>>> >>>>>> [2013-09-01 14:56:45.13972] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-01 14:56:45.15433] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-01 15:57:34.592938] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-01 15:57:34.593383] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-01 15:57:45.374301] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-01 15:57:45.375871] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-01 16:02:09.115976] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-01 16:02:09.116446] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-01 16:02:19.869340] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-01 16:02:19.870598] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-01 16:32:58.701617] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-01 16:32:58.702126] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-01 16:33:09.456556] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-01 16:33:09.458009] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-01 21:47:13.442514] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-01 21:47:13.442899] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-01 21:47:24.240978] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-01 21:47:24.242424] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-02 02:12:14.26339] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-02 02:12:14.26809] I [syncdutils(slave):142:finalize] <top>: >>>>>> exiting. >>>>>> [2013-09-02 02:12:24.818355] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-02 02:12:24.820008] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-02 02:16:14.525187] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-02 02:16:14.525675] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-02 02:16:25.263712] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-02 02:16:25.265168] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> [2013-09-02 02:37:39.315608] I [repce(slave):78:service_loop] >>>>>> RepceServer: terminating on reaching EOF. >>>>>> [2013-09-02 02:37:39.316071] I [syncdutils(slave):142:finalize] >>>>>> <top>: exiting. >>>>>> [2013-09-02 02:37:50.78136] I [gsyncd(slave):354:main_i] <top>: >>>>>> syncing: file:///data/docstore1 >>>>>> [2013-09-02 02:37:50.79577] I [resource(slave):453:service_loop] >>>>>> FILE: slave listening >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Sep 5, 2013 at 10:09 AM, Venky Shankar < >>>>>> yknev.shankar at gmail.com> wrote: >>>>>> >>>>>>> Could you also provide the slave logs? (log location on the >>>>>>> slave: /var/log/glusterfs/geo-replication-slaves) >>>>>>> >>>>>>> Thanks, >>>>>>> -venky >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 5, 2013 at 7:29 PM, Tony Maro <tonym at evrichart.com>wrote: >>>>>>> >>>>>>>> I'm trying to create a new Geo-Rep of about 3 TB of data >>>>>>>> currently stored in a 2 brick mirror config. Obviously the geo-rep >>>>>>>> destination is a third server. >>>>>>>> >>>>>>>> This is my 150th attempt. Okay, maybe not that far, but it's >>>>>>>> pretty darn bad. >>>>>>>> >>>>>>>> Replication works fine until I hit around 1TB of data sync'd, then >>>>>>>> it just stalls. For the past two days it hasn't gone past 1050156672 bytes >>>>>>>> sync'd to the destination server. >>>>>>>> >>>>>>>> I did some digging in the logs and it looks like the brick that's >>>>>>>> running the geo-rep process thinks it's syncing: >>>>>>>> >>>>>>>> [2013-09-05 09:45:37.354831] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000863.enc ... >>>>>>>> [2013-09-05 09:45:37.358669] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/0000083b.enc ... >>>>>>>> [2013-09-05 09:45:37.362251] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/0000087b.enc ... >>>>>>>> [2013-09-05 09:45:37.366027] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000834.enc ... >>>>>>>> [2013-09-05 09:45:37.369752] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000845.enc ... >>>>>>>> [2013-09-05 09:45:37.373528] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000864.enc ... >>>>>>>> [2013-09-05 09:45:37.377037] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/0000087f.enc ... >>>>>>>> [2013-09-05 09:45:37.391432] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000897.enc ... >>>>>>>> [2013-09-05 09:45:37.395059] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000829.enc ... >>>>>>>> [2013-09-05 09:45:37.398725] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/00000816.enc ... >>>>>>>> [2013-09-05 09:45:37.402559] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/000008cc.enc ... >>>>>>>> [2013-09-05 09:45:37.406450] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/000008d2.enc ... >>>>>>>> [2013-09-05 09:45:37.410310] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/000008df.enc ... >>>>>>>> [2013-09-05 09:45:37.414344] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/00/00/08/000008bd.enc ... >>>>>>>> [2013-09-05 09:45:37.438173] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/volume.info ... >>>>>>>> [2013-09-05 09:45:37.441675] D [master:386:crawl] GMaster: syncing >>>>>>>> ./evds3/Sky_Main_66/volume.enc ... >>>>>>>> >>>>>>>> But, *those files never appear on the destination server,* however >>>>>>>> the containing folders are there, just empty. >>>>>>>> >>>>>>>> Also, the other log file (...gluster.log) in the geo-replication >>>>>>>> log folder that matches the destination stopped updating when the syncing >>>>>>>> stopped apparently. It's last timestamp is from the 2nd, which is the last >>>>>>>> time data transferred. >>>>>>>> >>>>>>>> The last bit from that log file is as such: >>>>>>>> >>>>>>>> >>>>>>>> +------------------------------------------------------------------------------+ >>>>>>>> [2013-09-02 06:37:50.109730] I [rpc-clnt.c:1654:rpc_clnt_reconfig] >>>>>>>> 0-docstore1-client-1: changing port to 24009 (from 0) >>>>>>>> [2013-09-02 06:37:50.109857] I [rpc-clnt.c:1654:rpc_clnt_reconfig] >>>>>>>> 0-docstore1-client-0: changing port to 24009 (from 0) >>>>>>>> [2013-09-02 06:37:54.097468] I >>>>>>>> [client-handshake.c:1614:select_server_supported_programs] >>>>>>>> 0-docstore1-client-1: Using Program GlusterFS 3.3.2, Num (1298437), Version >>>>>>>> (330) >>>>>>>> [2013-09-02 06:37:54.097973] I >>>>>>>> [client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-1: >>>>>>>> Connected to 10.200.1.6:24009, attached to remote volume >>>>>>>> '/data/docstore1'. >>>>>>>> [2013-09-02 06:37:54.098005] I >>>>>>>> [client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-1: Server >>>>>>>> and Client lk-version numbers are not same, reopening the fds >>>>>>>> [2013-09-02 06:37:54.098094] I [afr-common.c:3685:afr_notify] >>>>>>>> 0-docstore1-replicate-0: Subvolume 'docstore1-client-1' came back up; going >>>>>>>> online. >>>>>>>> [2013-09-02 06:37:54.098274] I >>>>>>>> [client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-1: >>>>>>>> Server lk version = 1 >>>>>>>> [2013-09-02 06:37:54.098619] I >>>>>>>> [client-handshake.c:1614:select_server_supported_programs] >>>>>>>> 0-docstore1-client-0: Using Program GlusterFS 3.3.2, Num (1298437), Version >>>>>>>> (330) >>>>>>>> [2013-09-02 06:37:54.099191] I >>>>>>>> [client-handshake.c:1411:client_setvolume_cbk] 0-docstore1-client-0: >>>>>>>> Connected to 10.200.1.5:24009, attached to remote volume >>>>>>>> '/data/docstore1'. >>>>>>>> [2013-09-02 06:37:54.099222] I >>>>>>>> [client-handshake.c:1423:client_setvolume_cbk] 0-docstore1-client-0: Server >>>>>>>> and Client lk-version numbers are not same, reopening the fds >>>>>>>> [2013-09-02 06:37:54.105891] I >>>>>>>> [fuse-bridge.c:4191:fuse_graph_setup] 0-fuse: switched to graph 0 >>>>>>>> [2013-09-02 06:37:54.106039] I >>>>>>>> [client-handshake.c:453:client_set_lk_version_cbk] 0-docstore1-client-0: >>>>>>>> Server lk version = 1 >>>>>>>> [2013-09-02 06:37:54.106179] I [fuse-bridge.c:3376:fuse_init] >>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel >>>>>>>> 7.17 >>>>>>>> [2013-09-02 06:37:54.108766] I >>>>>>>> [afr-common.c:2022:afr_set_root_inode_on_first_lookup] >>>>>>>> 0-docstore1-replicate-0: added root inode >>>>>>>> >>>>>>>> >>>>>>>> This is driving me nuts - I've been working on getting >>>>>>>> Geo-Replication working for over 2 months now without any success. >>>>>>>> >>>>>>>> Status on the geo-rep shows OK: >>>>>>>> >>>>>>>> root at gfs6:~# gluster volume geo-replication docstore1 >>>>>>>> ssh://root at backup-ds2.gluster:/data/docstore1 status >>>>>>>> MASTER SLAVE >>>>>>>> STATUS >>>>>>>> >>>>>>>> -------------------------------------------------------------------------------- >>>>>>>> docstore1 ssh://root at backup-ds2.gluster:/data/docstore1 >>>>>>>> OK >>>>>>>> root at gfs6:~# >>>>>>>> >>>>>>>> Here's the config: >>>>>>>> >>>>>>>> root at gfs6:~# gluster volume geo-replication docstore1 >>>>>>>> ssh://root at backup-ds2.gluster:/data/docstore1 config >>>>>>>> log_level: DEBUG >>>>>>>> gluster_log_file: >>>>>>>> /var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.gluster.log >>>>>>>> ssh_command: ssh -oPasswordAuthentication=no >>>>>>>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem >>>>>>>> session_owner: 24f8c92d-723e-4513-9593-40ef4b7e766a >>>>>>>> remote_gsyncd: /usr/lib/glusterfs/glusterfs/gsyncd >>>>>>>> state_file: >>>>>>>> /var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.status >>>>>>>> gluster_command_dir: /usr/sbin/ >>>>>>>> pid_file: >>>>>>>> /var/lib/glusterd/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.pid >>>>>>>> log_file: >>>>>>>> /var/log/glusterfs/geo-replication/docstore1/ssh%3A%2F%2Froot%4010.200.1.12%3Afile%3A%2F%2F%2Fdata%2Fdocstore1.log >>>>>>>> gluster_params: xlator-option=*-dht.assert-no-child-down=true >>>>>>>> root at gfs6:~# >>>>>>>> >>>>>>>> I'm running Ubuntu packages 3.3.2-ubuntu1-precise2 from the ppa. >>>>>>>> Any ideas for why it's stalling? >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> >>>>>> *Tony Maro* >>>>>> Chief Information Officer >>>>>> EvriChart ? www.evrichart.com >>>>>> Advanced Records Management >>>>>> Office | 888.801.2020 ? 304.536.1290 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> >>>> *Tony Maro* >>>> Chief Information Officer >>>> EvriChart ? www.evrichart.com >>>> Advanced Records Management >>>> Office | 888.801.2020 ? 304.536.1290 >>>> >>>> >>> >> >> >> -- >> Thanks, >> >> *Tony Maro* >> Chief Information Officer >> EvriChart ? www.evrichart.com >> Advanced Records Management >> Office | 888.801.2020 ? 304.536.1290 >> >> > > > -- > Thanks, > > *Tony Maro* > Chief Information Officer > EvriChart ? www.evrichart.com > Advanced Records Management > Office | 888.801.2020 ? 304.536.1290 > > -- Thanks, *Tony Maro* Chief Information Officer EvriChart ? www.evrichart.com Advanced Records Management Office | 888.801.2020 ? 304.536.1290 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130906/ae09ad65/attachment.html>