Re: Geo-replication completely broken

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey Rob,


same issue for our third volume. Have a look at the logs just from right now (below).

Question: You removed the htime files and the old changelogs. Just rm the files or is there something to pay more attention

before removing the changelog files and the htime file.

Regards,

Felix

[2020-06-25 07:51:53.795430] I [resource(worker /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH: SSH connection between master and slave established.    duration=1.2341
[2020-06-25 07:51:53.795639] I [resource(worker /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER: Mounting gluster volume locally...
[2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor] Monitor: worker died in startup phase    brick=/gluster/vg01/dispersed_fuse1024/brick
[2020-06-25 07:51:54.535809] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change    status=Faulty
[2020-06-25 07:51:54.882143] I [resource(worker /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER: Mounted gluster volume    duration=1.0864
[2020-06-25 07:51:54.882388] I [subcmds(worker /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2020-06-25 07:51:56.911412] E [repce(agent /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 40, in register
    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 46, in cl_register
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 30, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2020-06-25 07:51:56.912056] E [repce(worker /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient: call failed    call=75086:140098349655872:1593071514.91    method=register    error=ChangelogException
[2020-06-25 07:51:56.912396] E [resource(worker /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop] GLUSTER: Changelog register failed    error=[Errno 2] No such file or directory
[2020-06-25 07:51:56.928031] I [repce(agent /gluster/vg00/dispersed_fuse1024/brick):96:service_loop] RepceServer: terminating on reaching EOF.
[2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor] Monitor: worker died in startup phase    brick=/gluster/vg00/dispersed_fuse1024/brick
[2020-06-25 07:51:57.895920] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change    status=Faulty
[2020-06-25 07:51:58.607405] I [gsyncdstatus(worker /gluster/vg00/dispersed_fuse1024/brick):287:set_passive] GeorepStatus: Worker Status Change    status=Passive
[2020-06-25 07:51:58.607768] I [gsyncdstatus(worker /gluster/vg01/dispersed_fuse1024/brick):287:set_passive] GeorepStatus: Worker Status Change    status=Passive
[2020-06-25 07:51:58.608004] I [gsyncdstatus(worker /gluster/vg00/dispersed_fuse1024/brick):281:set_active] GeorepStatus: Worker Status Change    status=Active


On 25/06/2020 09:15, Rob.Quagliozzi@xxxxxxxxxxxx wrote:

Hi All,

 

We’ve got two six node RHEL 7.8 clusters and geo-replication would appear to be completely broken between them. I’ve deleted the session, removed & recreated pem files, old changlogs/htime (after removing relevant options from volume) and completely set up geo-rep from scratch, but the new session comes up as Initializing, then goes faulty, and starts looping. Volume (on both sides) is a 4 x 2 disperse, running Gluster v6 (RH latest).  Gsyncd reports:

 

[2020-06-25 07:07:14.701423] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...

[2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor: starting gsyncd worker   brick=/rhgs/brick20/brick       slave_node=bxts470194.eu.rabonet.com

[2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor: Worker would mount volume privately

[2020-06-25 07:07:14.757181] I [gsyncd(agent /rhgs/brick20/brick):318:main] <top>: Using session config file    path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

[2020-06-25 07:07:14.758126] D [subcmds(agent /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD      rpc_fd='5,12,11,10'

[2020-06-25 07:07:14.758627] I [changelogagent(agent /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...

[2020-06-25 07:07:14.764234] I [gsyncd(worker /rhgs/brick20/brick):318:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

[2020-06-25 07:07:14.779409] I [resource(worker /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH connection between master and slave...

[2020-06-25 07:07:14.841793] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068834.84 __repce_version__() ...

[2020-06-25 07:07:16.148725] D [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient: call 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0

[2020-06-25 07:07:16.148911] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068836.15 version() ...

[2020-06-25 07:07:16.149574] D [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient: call 6799:140380783982400:1593068836.15 version -> 1.0

[2020-06-25 07:07:16.149735] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068836.15 pid() ...

[2020-06-25 07:07:16.150588] D [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient: call 6799:140380783982400:1593068836.15 pid -> 30703

[2020-06-25 07:07:16.150747] I [resource(worker /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between master and slave established.     duration=1.3712

[2020-06-25 07:07:16.150819] I [resource(worker /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume locally...

[2020-06-25 07:07:16.265860] D [resource(worker /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs mount in place

[2020-06-25 07:07:17.272511] D [resource(worker /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs mount prepared

[2020-06-25 07:07:17.272708] I [resource(worker /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster volume      duration=1.1218

[2020-06-25 07:07:17.272794] I [subcmds(worker /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor

[2020-06-25 07:07:17.272973] D [master(worker /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection mode mode=xsync

[2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor: worker(/rhgs/brick20/brick) connected

[2020-06-25 07:07:17.273678] D [master(worker /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection mode mode=changelog

[2020-06-25 07:07:17.274224] D [master(worker /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection mode mode=changeloghistory

[2020-06-25 07:07:17.276484] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068837.28 version() ...

[2020-06-25 07:07:17.276916] D [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient: call 6799:140380783982400:1593068837.28 version -> 1.0

[2020-06-25 07:07:17.277009] D [master(worker /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog working dir /var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick

[2020-06-25 07:07:17.277098] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068837.28 init() ...

[2020-06-25 07:07:17.292944] D [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient: call 6799:140380783982400:1593068837.28 init -> None

[2020-06-25 07:07:17.293097] D [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call 6799:140380783982400:1593068837.29 register('/rhgs/brick20/brick', '/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick', '/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log', 8, 5) ...

[2020-06-25 07:07:19.296294] E [repce(agent /rhgs/brick20/brick):121:worker] <top>: call failed:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117, in worker

    res = getattr(self.obj, rmeth)(*in_data[2:])

  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 40, in register

    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)

  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 46, in cl_register

    cls.raise_changelog_err()

  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 30, in raise_changelog_err

    raise ChangelogException(errn, os.strerror(errn))

ChangelogException: [Errno 2] No such file or directory

[2020-06-25 07:07:19.297161] E [repce(worker /rhgs/brick20/brick):213:__call__] RepceClient: call failed        call=6799:140380783982400:1593068837.29 method=register error=ChangelogException

[2020-06-25 07:07:19.297338] E [resource(worker /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register failed      error=[Errno 2] No such file or directory

[2020-06-25 07:07:19.315074] I [repce(agent /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on reaching EOF.

[2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor: worker died in startup phase     brick=/rhgs/brick20/brick

[2020-06-25 07:07:20.277383] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty

 

We’ve done everything we can think of, including an “strace –f” on the pid, and we can’t really find anything. I’m about to lose the last of my hair over this, so does anyone have any ideas at all? We’ve even removed the entire slave vol and rebuilt it.

 

Thanks

Rob

 

Rob Quagliozzi

Specialised Application Support



 


This email (including any attachments to it) is confidential, legally privileged, subject to copyright and is sent for the personal attention of the intended recipient only. If you have received this email in error, please advise us immediately and delete it. You are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Although we have taken reasonable precautions to ensure no viruses are present in this email, we cannot accept responsibility for any loss or damage arising from the viruses in this email or attachments. We exclude any liability for the content of this email, or for the consequences of any actions taken on the basis of the information provided in this email or its attachments, unless that information is subsequently confirmed in writing. <#rbnl#1898i>

 


________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux