Re: Geo-replication stops after 4-5 hours

Marcus Pedersén <marcus.pedersen@xxxxxx> · Sun, 12 Aug 2018 20:18:32 +0000

Hi, 
As the geo-replication stopped after 4-5 hours, I added a cron job that stopped, paused for 2 mins and started geo-replication again every 6 hours.
The cron job has been running for 5 days and the changelogs has been catching up.

Now a different behavior has shown up.
In one of the active master nodes I get a python error.
The other active master node has started to toggle status between active and faulty.
See parts of logs below.

When I read Troubleshooting Geo-replication, there is a suggestion when sync is not complete, to enforce a full sync of the data by erasing the index and restarting GlusterFS geo-replication.
There is no explanation of how to erase the index.
Should I enforse a full sync?
How do I erase the index?

Thanks a lot!

Best regards
Marcus Pedersén

Node with python error:
[2018-08-12 16:02:05.304924] I [resource(worker /urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between master and slave...

[2018-08-12 16:02:06.842832] I [resource(worker /urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and slave established.        duration=1.5376

[2018-08-12 16:02:06.843370] I [resource(worker /urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...

[2018-08-12 16:02:07.930706] I [resource(worker /urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0869

[2018-08-12 16:02:07.931536] I [subcmds(worker /urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor

[2018-08-12 16:02:20.759797] I [master(worker /urd-gds/gluster):1534:register] _GMaster: Working dir    path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster

[2018-08-12 16:02:20.760411] I [resource(worker /urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534089740

[2018-08-12 16:02:20.831918] I [gsyncdstatus(worker /urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change        status=Active

[2018-08-12 16:02:20.835541] I [gsyncdstatus(worker /urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change    status=History Crawl

[2018-08-12 16:02:20.836832] I [master(worker /urd-gds/gluster):1448:crawl] _GMaster: starting history crawl    turns=1 stime=(1523906126, 0)   entry_stime=None        etime=1534089740

[2018-08-12 16:02:21.848570] I [master(worker /urd-gds/gluster):1477:crawl] _GMaster: slave's time      stime=(1523906126, 0)

[2018-08-12 16:02:21.950453] E [syncdutils(worker /urd-gds/gluster):330:log_raise_exception] <top>: FAIL:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 360, in twrap

    tf(*aargs)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1880, in syncjob

    po = self.sync_engine(pb, self.log_err)

  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1413, in rsync

    rconf.ssh_ctl_args + \

AttributeError: 'NoneType' object has no attribute 'split'

[2018-08-12 16:02:21.975228] I [repce(agent /urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.

[2018-08-12 16:02:22.947170] I [monitor(monitor):272:monitor] Monitor: worker died in startup phase     brick=/urd-gds/gluster

[2018-08-12 16:02:22.954096] I [gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change status=Faulty

[2018-08-12 16:02:32.973948] I [monitor(monitor):158:monitor] Monitor: starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000

[2018-08-12 16:02:33.16155] I [gsyncd(agent /urd-gds/gluster):297:main] <top>: Using session config file        path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf

[2018-08-12 16:02:33.16882] I [changelogagent(agent /urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...

[2018-08-12 16:02:33.17292] I [gsyncd(worker /urd-gds/gluster):297:main] <top>: Using session config file       path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf

[2018-08-12 16:02:33.26951] I [resource(worker /urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between master and slave...

[2018-08-12 16:02:34.642838] I [resource(worker /urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and slave established.        duration=1.6156

[2018-08-12 16:02:34.643369] I [resource(worker /urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...

Node that toggles status between active and faulty:
[2018-08-12 19:33:03.475833] I [master(worker /urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.2757 num_files=27    job=2   return_code=23

[2018-08-12 19:33:04.818854] I [master(worker /urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.3767 num_files=67    job=1   return_code=23

[2018-08-12 19:33:09.926820] E [repce(worker /urd-gds/gluster):197:__call__] RepceClient: call failed   call=14853:139697829693248:1534102389.64        method=entry_ops        error=GsyncdError

[2018-08-12 19:33:09.927042] E [syncdutils(worker /urd-gds/gluster):298:log_raise_exception] <top>: execution of "gluster" failed with ENOENT (No such file or directory)

[2018-08-12 19:33:09.942267] I [repce(agent /urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.

[2018-08-12 19:33:10.349848] I [monitor(monitor):272:monitor] Monitor: worker died in startup phase     brick=/urd-gds/gluster

[2018-08-12 19:33:10.363173] I [gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change status=Faulty

[2018-08-12 19:33:20.386089] I [monitor(monitor):158:monitor] Monitor: starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000

[2018-08-12 19:33:20.456687] I [gsyncd(agent /urd-gds/gluster):297:main] <top>: Using session config file       path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf

[2018-08-12 19:33:20.456686] I [gsyncd(worker /urd-gds/gluster):297:main] <top>: Using session config file      path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf

[2018-08-12 19:33:20.457559] I [changelogagent(agent /urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...

[2018-08-12 19:33:20.511825] I [resource(worker /urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between master and slave...

[2018-08-12 19:33:22.88713] I [resource(worker /urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and slave established. duration=1.5766

[2018-08-12 19:33:22.89272] I [resource(worker /urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...

[2018-08-12 19:33:23.179249] I [resource(worker /urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0896

[2018-08-12 19:33:23.179805] I [subcmds(worker /urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor

[2018-08-12 19:33:35.245277] I [master(worker /urd-gds/gluster):1534:register] _GMaster: Working dir    path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster

[2018-08-12 19:33:35.246495] I [resource(worker /urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534102415

[2018-08-12 19:33:35.321988] I [gsyncdstatus(worker /urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change        status=Active

[2018-08-12 19:33:35.324270] I [gsyncdstatus(worker /urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change    status=History Crawl

[2018-08-12 19:33:35.324902] I [master(worker /urd-gds/gluster):1448:crawl] _GMaster: starting history crawl    turns=1 stime=(1525290650, 0)   entry_stime=(1525296245, 0)     etime=1534102415

[2018-08-12 19:33:35.328735] I [master(worker /urd-gds/gluster):1477:crawl] _GMaster: slave's time      stime=(1525290650, 0)

[2018-08-12 19:33:35.574338] I [master(worker /urd-gds/gluster):1301:process] _GMaster: Skipping already processed entry ops    to_changelog=1525290651 num_changelogs=1        from_changelog=1525290651

[2018-08-12 19:33:35.574448] I [master(worker /urd-gds/gluster):1315:process] _GMaster: Entry Time Taken        MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0

[2018-08-12 19:33:35.574507] I [master(worker /urd-gds/gluster):1325:process] _GMaster: Data/Metadata Time Taken        SETA=1  SETX=0  meta_duration=0.0249    data_duration=0.2156    DATA="" XATT=0

[2018-08-12 19:33:35.574723] I [master(worker /urd-gds/gluster):1335:process] _GMaster: Batch Completed changelog_end=1525290651        entry_stime=(1525296245, 0)     changelog_start=1525290651      stime=(152\

5290650, 0)   duration=0.2455 num_changelogs=1        mode=history_changelog

[2018-08-12 19:33:35.582545] I [master(worker /urd-gds/gluster):1477:crawl] _GMaster: slave's time      stime=(1525290650, 0)

[2018-08-12 19:33:35.780823] I [master(worker /urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0847 num_files=3     job=2   return_code=23

[2018-08-12 19:33:37.362822] I [master(worker /urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0807 num_files=4     job=2   return_code=23

[2018-08-12 19:33:37.818542] I [master(worker /urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1098 num_files=11    job=1   return_code=23 

Från: gluster-users-bounces@xxxxxxxxxxx <gluster-users-bounces@xxxxxxxxxxx> för Marcus Pedersén <marcus.pedersen@xxxxxx>

Skickat: den 6 augusti 2018 13:28

Till: khiremat@xxxxxxxxxx

Kopia: gluster-users@xxxxxxxxxxx

Ämne: Re:  Geo-replication stops after 4-5 hours

Hi,
Is there a way to resolve the problem with rsync and hanging processes?
Do I need to kill all the processes and hope that it starts again or stop/start geo-replication?

If I stop/start geo-replication it will start again, I have tried it before.

Regards
Marcus

Från: gluster-users-bounces@xxxxxxxxxxx <gluster-users-bounces@xxxxxxxxxxx> för Marcus Pedersén <marcus.pedersen@xxxxxx>

Skickat: den 2 augusti 2018 10:04

Till: Kotresh Hiremath Ravishankar

Kopia: gluster-users@xxxxxxxxxxx

Ämne: Re:  Geo-replication stops after 4-5 hours

Hi Kotresh,

I get the following and then it hangs:

strace: Process 5921 attached                    write(2, "rsync: link_stat \"/tmp/gsyncd-au"..., 12811

When sync is running I can see rsync with geouser on the slave node.

Regards
Marcus

################

Marcus Pedersén

Systemadministrator 

Interbull Centre

################

Sent from my phone 

################

Den 2 aug. 2018 09:31 skrev Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx>:

Cool, just check whether they are hung by any chance with following command.

#strace -f -p 5921

On Thu, Aug 2, 2018 at 12:25 PM, Marcus Pedersén 
<marcus.pedersen@xxxxxx> wrote:

On both active master nodes there is an rsync process. As in:

root      5921  0.0  0.0 115424  1176 ?        S    Aug01   0:00 rsync -aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs --xattrs --acls . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
 -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-stuphs/bf60c68f1a195dad59573a8dbaa309f2.sock geouser@urd-gds-geo-001:/proc/13077/cwd

There is also ssh tunnels to slave nodes and  gsyncd.py processes.

Regards
Marcus 

################

Marcus Pedersén

Systemadministrator 

Interbull Centre

################

Sent from my phone 

################

Den 2 aug. 2018 08:07 skrev Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx>:

Could you look of any rsync processes hung in master or slave?

On Thu, Aug 2, 2018 at 11:18 AM, Marcus Pedersén 
<marcus.pedersen@xxxxxx> wrote:

Hi Kortesh,
rsync  version 3.1.2  protocol version 31
All nodes run CentOS 7, updated the last couple of days.

Thanks
Marcus 

################

Marcus Pedersén

Systemadministrator 

Interbull Centre

################

Sent from my phone 

################

Den 2 aug. 2018 06:13 skrev Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx>:

Hi Marcus,

What's the rsync version being used?

Thanks,

Kotresh HR

On Thu, Aug 2, 2018 at 1:48 AM, Marcus Pedersén <marcus.pedersen@xxxxxx> wrote:

Hi all!
I upgraded from 3.12.9 to 4.1.1 and had problems with geo-replication.
With help from the list with some sym links and so on (handled in another thread)
I got the geo-replication running.
It ran for 4-5 hours and then stopped, I stopped and started geo-replication and it ran for another 4-5 hours.
4.1.2 was released and I updated, hoping this would solve the problem.
I still have the same problem, at start it runs for 4-5 hours and then it stops.
After that nothing happens, I have waited for days but still nothing happens.

I have looked through logs but can not find anything obvious.

Status for geo-replication is active for the two same nodes all the time:

MASTER NODE    MASTER VOL        MASTER BRICK         SLAVE USER    SLAVE                                      SLAVE NODE         STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY    DATA     META    FAILURES    CHECKPOINT TIME        CHECKPOINT
 COMPLETED    CHECKPOINT COMPLETION TIME   

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

urd-gds-001    urd-gds-volume    /urd-gds/gluster     geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History Crawl    2018-04-16 20:32:09    0        14205    0       0           2018-07-27 21:12:44    No                     
 N/A                          

urd-gds-002    urd-gds-volume    /urd-gds/gluster     geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A              N/A                    N/A      N/A      N/A     N/A         N/A                    N/A                    
 N/A                          

urd-gds-004    urd-gds-volume    /urd-gds/gluster     geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A              N/A                    N/A      N/A      N/A     N/A         N/A                    N/A                    
 N/A                          

urd-gds-003    urd-gds-volume    /urd-gds/gluster     geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History Crawl    2018-05-01 20:58:14    285      4552     0       0           2018-07-27 21:12:44    No                     
 N/A                          

urd-gds-000    urd-gds-volume    /urd-gds/gluster1    geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A              N/A                    N/A      N/A      N/A     N/A         N/A                    N/A                    
 N/A                          

urd-gds-000    urd-gds-volume    /urd-gds/gluster2    geouser       geouser@urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A              N/A                    N/A      N/A      N/A     N/A         N/A                    N/A                    
 N/A 

Master cluster is Distribute-Replicate 
2 x (2 + 1) 
Used space 30TB

Slave cluster is Replicate
1 x (2 + 1)
Used space 9TB

Parts from gsyncd.logs are enclosed.

Thanks a lot!

Best regards
Marcus Pedersén

---