Re: Geo Replication stops replicating

deepu srinivasan <sdeepugd@xxxxxxxxx> · Tue, 4 Jun 2019 17:23:51 +0530

Hi Kortesh 
Please find the logs of the above error
Master log snippet
[2019-06-04 11:52:09.254731] I [resource(worker /home/sas/gluster/data/code-misc):1379:connect_remote] SSH: Initializing SSH connection between master and slave...
 [2019-06-04 11:52:09.308923] D [repce(worker /home/sas/gluster/data/code-misc):196:push] RepceClient: call 89724:139652759443264:1559649129.31 __repce_version__() ...
 [2019-06-04 11:52:09.602792] E [syncdutils(worker /home/sas/gluster/data/code-misc):311:log_raise_exception] <top>: connection to peer is broken
 [2019-06-04 11:52:09.603312] E [syncdutils(worker /home/sas/gluster/data/code-misc):805:errlog] Popen: command returned error   cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/ glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-4aL2tc/d893f66e0addc32f7d0080bb503f5185.sock sas@192.168.185.107 /usr/libexec/glusterfs/gsyncd slave code-misc sas@   192.168.185.107::code-misc --master-node 192.168.185.106 --master-node-id 851b64d0-d885-4ae9-9b38-ab5b15db0fec --master-brick /home/sas/gluster/data/code-misc --local-node 192.168.185.122 --local-node-   id bcaa7af6-c3a1-4411-8e99-4ebecb32eb6a --slave-timeout 120 --slave-log-level DEBUG --slave-gluster-log-level INFO --slave-gluster-command-dir /usr/sbin   error=1
 [2019-06-04 11:52:09.614996] I [repce(agent /home/sas/gluster/data/code-misc):97:service_loop] RepceServer: terminating on reaching EOF.
 [2019-06-04 11:52:09.615545] D [monitor(monitor):271:monitor] Monitor: worker(/home/sas/gluster/data/code-misc) connected
 [2019-06-04 11:52:09.616528] I [monitor(monitor):278:monitor] Monitor: worker died in startup phase brick=/home/sas/gluster/data/code-misc
 [2019-06-04 11:52:09.619391] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty

Slave log snippet
[2019-06-04 11:50:09.782668] E [syncdutils(slave 
192.168.185.106/home/sas/gluster/data/code-misc):809:logerr] Popen: 
/usr/sbin/gluster> 2 : failed with this errno (No such file or 
directory)
[2019-06-04 11:50:11.188167] W [gsyncd(slave 
192.168.185.125/home/sas/gluster/data/code-misc):305:main] <top>: 
Session config file not exists, using the default config	
path=/var/lib/glusterd/geo-replication/code-misc_192.168.185.107_code-misc/gsyncd.conf
[2019-06-04
 11:50:11.201070] I [resource(slave 
192.168.185.125/home/sas/gluster/data/code-misc):1098:connect] GLUSTER: 
Mounting gluster volume locally...
[2019-06-04 11:50:11.271231] E 
[resource(slave 
192.168.185.125/home/sas/gluster/data/code-misc):1006:handle_mounter] 
MountbrokerMounter: glusterd answered	mnt=
[2019-06-04 
11:50:11.271998] E [syncdutils(slave 
192.168.185.125/home/sas/gluster/data/code-misc):805:errlog] Popen: 
command returned error	cmd=/usr/sbin/gluster --remote-host=localhost 
system:: mount sas user-map-root=sas aux-gfid-mount acl log-level=INFO 
log-file=/var/log/glusterfs/geo-replication-slaves/code-misc_192.168.185.107_code-misc/mnt-192.168.185.125-home-sas-gluster-data-code-misc.log
 volfile-server=localhost volfile-id=code-misc client-pid=-1	error=1
[2019-06-04
 11:50:11.272113] E [syncdutils(slave 
192.168.185.125/home/sas/gluster/data/code-misc):809:logerr] Popen: 
/usr/sbin/gluster> 2 : failed with this errno (No such file or 
directory)

On Tue, Jun 4, 2019 at 5:10 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi As discussed I have upgraded gluster from 4.1 to 6.2 version. But the Geo replication failed to start. 
Stays in faulty state

On Fri, May 31, 2019, 5:32 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Checked the data. It remains in 2708. No progress.

On Fri, May 31, 2019 at 4:36 PM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:
That means it could be working and the defunct process might be some old zombie one. Could you check, that data progress ?

On Fri, May 31, 2019 at 4:29 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi When i change the rsync option the rsync process doesnt seem to start . Only a defunt process is listed in ps aux. Only when i set rsync option to " " and restart all the process the rsync process is listed in ps aux.

On Fri, May 31, 2019 at 4:23 PM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:
Yes, rsync config option should have fixed this issue.

Could you share the output of the following?

1.   gluster volume geo-replication <MASTERVOL> <SLAVEHOST>::<SLAVEVOL> config rsync-options
2.   ps -ef | grep rsync 

On Fri, May 31, 2019 at 4:11 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Done. 
We got the following result .
1559298781.338234 write(2, "rsync: link_stat \"/tmp/gsyncd-aux-mount-EEJ_sY/.gfid/3fa6aed8-802e-4efe-9903-8bc171176d88\" failed: No such file or directory (2)", 128
seems like a file is missing ? 

On Fri, May 31, 2019 at 3:25 PM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:
Hi,

Could you take the strace with with more string size? The argument strings are truncated.

strace -s 500 -ttt -T -p <rsync pid> 

On Fri, May 31, 2019 at 3:17 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi KotreshThe above-mentioned work around did not work properly.

On Fri, May 31, 2019 at 3:16 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi Kotresh We have tried the above-mentioned rsync option and we are planning to have the version upgrade to 6.0.

On Fri, May 31, 2019 at 11:04 AM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:
Hi,

This looks like the hang because stderr buffer filled up with errors messages and no one reading it.
I think this issue is fixed in latest releases. As a workaround, you can do following and check if it works.

Prerequisite: 
 rsync version should be > 3.1.0

Workaround:
gluster volume geo-replication <MASTERVOL> <SLAVEHOST>::<SLAVEVOL> config rsync-options "--ignore-missing-args"

Thanks,
Kotresh HR

On Thu, May 30, 2019 at 5:39 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi
We were evaluating Gluster geo Replication between two DCs one is in US west and one is in US east. We took multiple trials for different file size. 
The Geo Replication tends to stop replicating but while checking the status it appears to be in Active state. But the slave volume did not increase in size.
So we have restarted the geo-replication session and checked the status. The status was in an active state and it was in History Crawl for a long time. We have enabled the DEBUG mode in logging and checked for any error.
There was around 2000 file appeared for syncing candidate. The Rsync process starts but the rsync did not happen in the slave volume. Every time the rsync process appears in the "ps auxxx" list but the replication did not happen in the slave end. What would be the cause of this problem? Is there anyway to debug it?

We have also checked the strace of the rync program.
it displays something like this 

"write(2, "rsync: link_stat \"/tmp/gsyncd-au"..., 128"

We are using the below specs

Gluster version - 4.1.7
Sync mode - rsync
Volume - 1x3 in each end (master and slave)
Intranet Bandwidth - 10 Gig

-- 
Thanks and Regards,
Kotresh H R

-- 
Thanks and Regards,
Kotresh H R

-- 
Thanks and Regards,
Kotresh H R

-- 
Thanks and Regards,
Kotresh H R

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users