Re: [geo-rep] Replication faulty - gsyncd.log OSError: [Errno 13] Permission denied

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The problem occured on slave side whose error is propagated to master. Mostly any traceback with repce involved is related to problem in slave. Just check few lines above in the log to find the slave node, the crashed worker is connected to and get geo replication logs to further debug.





On Fri, 21 Sep 2018, 20:10 Kotte, Christian (Ext), <christian.kotte@xxxxxxxxxxxx> wrote:

Hi,

 

Any idea how to troubleshoot this?

 

New folders and files were created on the master and the replication went faulty. They were created via Samba.

 

Version: GlusterFS 4.1.3

 

[root@master]# gluster volume geo-replication status

 

MASTER NODE                         MASTER VOL     MASTER BRICK            SLAVE USER    SLAVE                                                             SLAVE NODE    STATUS    CRAWL STATUS    LAST_SYNCED

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

master    glustervol1    /bricks/brick1/brick    geoaccount    ssh://geoaccount@slave_1::glustervol1       N/A           Faulty    N/A             N/A

master    glustervol1    /bricks/brick1/brick    geoaccount    ssh://geoaccount@slave_2::glustervol1       N/A           Faulty    N/A             N/A

master    glustervol1    /bricks/brick1/brick    geoaccount    ssh://geoaccount@interimmaster::glustervol1   N/A           Faulty    N/A             N/A

 

The following error is repeatedly logged in the gsyncd.logs:

[2018-09-21 14:26:38.611479] I [repce(agent /bricks/brick1/brick):80:service_loop] RepceServer: terminating on reaching EOF.

[2018-09-21 14:26:39.211527] I [monitor(monitor):279:monitor] Monitor: worker died in startup phase     brick=/bricks/brick1/brick

[2018-09-21 14:26:39.214322] I [gsyncdstatus(monitor):244:set_worker_status] GeorepStatus: Worker Status Change status=Faulty

[2018-09-21 14:26:49.318953] I [monitor(monitor):158:monitor] Monitor: starting gsyncd worker   brick=/bricks/brick1/brick      slave_node=nrchbs-slp2020.nibr.novartis.net

[2018-09-21 14:26:49.471532] I [gsyncd(agent /bricks/brick1/brick):297:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/gsyncd.conf

[2018-09-21 14:26:49.473917] I [changelogagent(agent /bricks/brick1/brick):72:__init__] ChangelogAgent: Agent listining...

[2018-09-21 14:26:49.491359] I [gsyncd(worker /bricks/brick1/brick):297:main] <top>: Using session config file  path=/var/lib/glusterd/geo-replication/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/gsyncd.conf

[2018-09-21 14:26:49.538049] I [resource(worker /bricks/brick1/brick):1377:connect_remote] SSH: Initializing SSH connection between master and slave...

[2018-09-21 14:26:53.5017] I [resource(worker /bricks/brick1/brick):1424:connect_remote] SSH: SSH connection between master and slave established.      duration=3.4665

[2018-09-21 14:26:53.5419] I [resource(worker /bricks/brick1/brick):1096:connect] GLUSTER: Mounting gluster volume locally...

[2018-09-21 14:26:54.120374] I [resource(worker /bricks/brick1/brick):1119:connect] GLUSTER: Mounted gluster volume     duration=1.1146

[2018-09-21 14:26:54.121012] I [subcmds(worker /bricks/brick1/brick):70:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor

[2018-09-21 14:26:56.144460] I [master(worker /bricks/brick1/brick):1593:register] _GMaster: Working dir        path=/var/lib/misc/gluster/gsyncd/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/bricks-brick1-brick

[2018-09-21 14:26:56.145145] I [resource(worker /bricks/brick1/brick):1282:service_loop] GLUSTER: Register time time=1537540016

[2018-09-21 14:26:56.160064] I [gsyncdstatus(worker /bricks/brick1/brick):277:set_active] GeorepStatus: Worker Status Change    status=Active

[2018-09-21 14:26:56.161175] I [gsyncdstatus(worker /bricks/brick1/brick):249:set_worker_crawl_status] GeorepStatus: Crawl Status Change        status=History Crawl

[2018-09-21 14:26:56.161536] I [master(worker /bricks/brick1/brick):1507:crawl] _GMaster: starting history crawl        turns=1 stime=(1537522637, 0)   entry_stime=(1537537141, 0)     etime=1537540016

[2018-09-21 14:26:56.164277] I [master(worker /bricks/brick1/brick):1536:crawl] _GMaster: slave's time  stime=(1537522637, 0)

[2018-09-21 14:26:56.197065] I [master(worker /bricks/brick1/brick):1360:process] _GMaster: Skipping already processed entry ops        to_changelog=1537522638 num_changelogs=1        from_changelog=1537522638

[2018-09-21 14:26:56.197402] I [master(worker /bricks/brick1/brick):1374:process] _GMaster: Entry Time Taken    MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=1

[2018-09-21 14:26:56.197623] I [master(worker /bricks/brick1/brick):1384:process] _GMaster: Data/Metadata Time Taken    SETA=0  SETX=0  meta_duration=0.0000    data_duration=0.0284    DATA="" XATT=0

[2018-09-21 14:26:56.198230] I [master(worker /bricks/brick1/brick):1394:process] _GMaster: Batch Completed     changelog_end=1537522638        entry_stime=(1537537141, 0)     changelog_start=1537522638      stime=(1537522637, 0)   duration=0.0333 num_changelogs=1        mode=history_changelog

[2018-09-21 14:26:57.200436] I [master(worker /bricks/brick1/brick):1536:crawl] _GMaster: slave's time  stime=(1537522637, 0)

[2018-09-21 14:26:57.528625] E [repce(worker /bricks/brick1/brick):197:__call__] RepceClient: call failed       call=17209:140650361157440:1537540017.21        method=entry_ops        error=OSError

[2018-09-21 14:26:57.529371] E [syncdutils(worker /bricks/brick1/brick):332:log_raise_exception] <top>: FAIL:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 311, in main

    func(args)

  File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py", line 72, in subcmd_worker

    local.service_loop(remote)

  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1288, in service_loop

    g3.crawlwrap(_oneshot_=True)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 615, in crawlwrap

    self.crawl()

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1545, in crawl

    self.changelogs_batch_process(changes)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1445, in changelogs_batch_process

    self.process(batch)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1280, in process

    self.process_change(change, done, retry)

  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1179, in process_change

    failures = self.slave.server.entry_ops(entries)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 216, in __call__

    return self.ins(self.meth, *a)

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 198, in __call__

    raise res

OSError: [Errno 13] Permission denied: '/bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e'

 

The permissions look fine. The replication is done via geo user instead of root. It should be able to read, but I’m not sure if the syncdaemon runs under geoaccount!?

 

[root@master vRealize Operation Manager]# ll /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e

lrwxrwxrwx. 1 root root 75 Sep 21 09:39 /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e -> ../../6b/97/6b97b987-8aef-46c3-af27-20d3aa883016/vRealize Operation Manager

 

[root@master vRealize Operation Manager]# ll /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e/

total 4

drwxrwxr-x. 2 AD+user AD+group  131 Sep 21 10:14 6.7

drwxrwxr-x. 2 AD+user AD+group 4096 Sep 21 09:43 7.0

drwxrwxr-x. 2 AD+user AD+group   57 Sep 21 10:28 7.5

[root@master vRealize Operation Manager]#

 

It could be possible that the folder was renamed. I had 3 similar issues since I migrated to GlusterFS 4.x but couldn’t investigate much. I needed to completely wipe GlusterFS and geo-repliction to get rid of this error…

 

Any help is appreciated.

 

Regards,

 

Christian Kotte

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux