Re: Geo-replication completely broken

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Fri, 03 Jul 2020 15:54:52 +0300

Hi Felix,

It seems I missed your reply with the change log that Shwetha requested.

Best Regards,
Strahil Nikolov

На 3 юли 2020 г. 11:16:30 GMT+03:00, "Felix Kölzow" <felix.koelzow@xxxxxx> написа:
>Dear Users,
>the geo-replication is still broken. This is not really a comfortable
>situation.
>Does any user has had the same experience and is able to share a
>possible workaround?
>We are actually running gluster v6.0
>Regards,
>
>Felix
>
>
>On 25/06/2020 10:04, Shwetha Acharya wrote:
>> Hi Rob and Felix,
>>
>> Please share the *-changes.log files and brick logs, which will help
>> in analysis of the issue.
>>
>> Regards,
>> Shwetha
>>
>> On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koelzow@xxxxxx
>> <mailto:felix.koelzow@xxxxxx>> wrote:
>>
>>     Hey Rob,
>>
>>
>>     same issue for our third volume. Have a look at the logs just
>from
>>     right now (below).
>>
>>     Question: You removed the htime files and the old changelogs.
>Just
>>     rm the files or is there something to pay more attention
>>
>>     before removing the changelog files and the htime file.
>>
>>     Regards,
>>
>>     Felix
>>
>>     [2020-06-25 07:51:53.795430] I [resource(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
>>     SSH connection between master and slave established.   
>>     duration=1.2341
>>     [2020-06-25 07:51:53.795639] I [resource(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
>>     Mounting gluster volume locally...
>>     [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
>>     Monitor: worker died in startup phase
>>     brick=/gluster/vg01/dispersed_fuse1024/brick
>>     [2020-06-25 07:51:54.535809] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>Worker
>>     Status Change    status=Faulty
>>     [2020-06-25 07:51:54.882143] I [resource(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
>>     Mounted gluster volume    duration=1.0864
>>     [2020-06-25 07:51:54.882388] I [subcmds(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
>>     Worker spawn successful. Acknowledging back to monitor
>>     [2020-06-25 07:51:56.911412] E [repce(agent
>>     /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
>>     failed:
>>     Traceback (most recent call last):
>>       File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>     117, in worker
>>         res = getattr(self.obj, rmeth)(*in_data[2:])
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
>line
>>     40, in register
>>         return Changes.cl_register(cl_brick, cl_dir, cl_log,
>cl_level,
>>     retries)
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>line
>>     46, in cl_register
>>         cls.raise_changelog_err()
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>line
>>     30, in raise_changelog_err
>>         raise ChangelogException(errn, os.strerror(errn))
>>     ChangelogException: [Errno 2] No such file or directory
>>     [2020-06-25 07:51:56.912056] E [repce(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):213:__call__]
>RepceClient:
>>     call failed call=75086:140098349655872:1593071514.91
>>     method=register    error=ChangelogException
>>     [2020-06-25 07:51:56.912396] E [resource(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
>>     GLUSTER: Changelog register failed    error=[Errno 2] No such
>file
>>     or directory
>>     [2020-06-25 07:51:56.928031] I [repce(agent
>>     /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
>>     RepceServer: terminating on reaching EOF.
>>     [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
>>     Monitor: worker died in startup phase
>>     brick=/gluster/vg00/dispersed_fuse1024/brick
>>     [2020-06-25 07:51:57.895920] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>Worker
>>     Status Change    status=Faulty
>>     [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
>>     GeorepStatus: Worker Status Change    status=Passive
>>     [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
>>     /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
>>     GeorepStatus: Worker Status Change    status=Passive
>>     [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
>>     /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
>>     GeorepStatus: Worker Status Change    status=Active
>>
>>
>>     On 25/06/2020 09:15, Rob.Quagliozzi@xxxxxxxxxxxx
>>     <mailto:Rob.Quagliozzi@xxxxxxxxxxxx> wrote:
>>>
>>>     Hi All,
>>>
>>>     We’ve got two six node RHEL 7.8 clusters and geo-replication
>>>     would appear to be completely broken between them. I’ve deleted
>>>     the session, removed & recreated pem files, old changlogs/htime
>>>     (after removing relevant options from volume) and completely set
>>>     up geo-rep from scratch, but the new session comes up as
>>>     Initializing, then goes faulty, and starts looping. Volume (on
>>>     both sides) is a 4 x 2 disperse, running Gluster v6 (RH
>latest). 
>>>     Gsyncd reports:
>>>
>>>     [2020-06-25 07:07:14.701423] I
>>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>>     Worker Status Change status=Initializing...
>>>
>>>     [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor]
>>>     Monitor: starting gsyncd worker   brick=/rhgs/brick20/brick
>>>     slave_node=bxts470194.eu.rabonet.com
>>>     <http://bxts470194.eu.rabonet.com>
>>>
>>>     [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor]
>>>     Monitor: Worker would mount volume privately
>>>
>>>     [2020-06-25 07:07:14.757181] I [gsyncd(agent
>>>     /rhgs/brick20/brick):318:main] <top>: Using session config file
>>>    
>path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>>
>>>     [2020-06-25 07:07:14.758126] D [subcmds(agent
>>>     /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD     
>>>     rpc_fd='5,12,11,10'
>>>
>>>     [2020-06-25 07:07:14.758627] I [changelogagent(agent
>>>     /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent
>listining...
>>>
>>>     [2020-06-25 07:07:14.764234] I [gsyncd(worker
>>>     /rhgs/brick20/brick):318:main] <top>: Using session config file
>>>    
>path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>>
>>>     [2020-06-25 07:07:14.779409] I [resource(worker
>>>     /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
>>>     connection between master and slave...
>>>
>>>     [2020-06-25 07:07:14.841793] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068834.84 __repce_version__() ...
>>>
>>>     [2020-06-25 07:07:16.148725] D [repce(worker
>>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>>     6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>>>
>>>     [2020-06-25 07:07:16.148911] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068836.15 version() ...
>>>
>>>     [2020-06-25 07:07:16.149574] D [repce(worker
>>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>>     6799:140380783982400:1593068836.15 version -> 1.0
>>>
>>>     [2020-06-25 07:07:16.149735] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068836.15 pid() ...
>>>
>>>     [2020-06-25 07:07:16.150588] D [repce(worker
>>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>>     6799:140380783982400:1593068836.15 pid -> 30703
>>>
>>>     [2020-06-25 07:07:16.150747] I [resource(worker
>>>     /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection
>>>     between master and slave established. duration=1.3712
>>>
>>>     [2020-06-25 07:07:16.150819] I [resource(worker
>>>     /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster
>>>     volume locally...
>>>
>>>     [2020-06-25 07:07:16.265860] D [resource(worker
>>>     /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary
>>>     glusterfs mount in place
>>>
>>>     [2020-06-25 07:07:17.272511] D [resource(worker
>>>     /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary
>>>     glusterfs mount prepared
>>>
>>>     [2020-06-25 07:07:17.272708] I [resource(worker
>>>     /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
>>>     volume      duration=1.1218
>>>
>>>     [2020-06-25 07:07:17.272794] I [subcmds(worker
>>>     /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
>>>     successful. Acknowledging back to monitor
>>>
>>>     [2020-06-25 07:07:17.272973] D [master(worker
>>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>>     change detection mode mode=xsync
>>>
>>>     [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor]
>>>     Monitor: worker(/rhgs/brick20/brick) connected
>>>
>>>     [2020-06-25 07:07:17.273678] D [master(worker
>>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>>     change detection mode mode=changelog
>>>
>>>     [2020-06-25 07:07:17.274224] D [master(worker
>>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>>     change detection mode mode=changeloghistory
>>>
>>>     [2020-06-25 07:07:17.276484] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068837.28 version() ...
>>>
>>>     [2020-06-25 07:07:17.276916] D [repce(worker
>>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>>     6799:140380783982400:1593068837.28 version -> 1.0
>>>
>>>     [2020-06-25 07:07:17.277009] D [master(worker
>>>     /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
>>>     working dir
>>>    
>/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>>>
>>>     [2020-06-25 07:07:17.277098] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068837.28 init() ...
>>>
>>>     [2020-06-25 07:07:17.292944] D [repce(worker
>>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>>     6799:140380783982400:1593068837.28 init -> None
>>>
>>>     [2020-06-25 07:07:17.293097] D [repce(worker
>>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>>     6799:140380783982400:1593068837.29
>>>     register('/rhgs/brick20/brick',
>>>    
>'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>>>    
>'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
>>>     8, 5) ...
>>>
>>>     [2020-06-25 07:07:19.296294] E [repce(agent
>>>     /rhgs/brick20/brick):121:worker] <top>: call failed:
>>>
>>>     Traceback (most recent call last):
>>>
>>>       File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>>     117, in worker
>>>
>>>         res = getattr(self.obj, rmeth)(*in_data[2:])
>>>
>>>       File
>>>     "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
>>>     line 40, in register
>>>
>>>         return Changes.cl_register(cl_brick, cl_dir, cl_log,
>>>     cl_level, retries)
>>>
>>>       File
>>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>>     line 46, in cl_register
>>>
>>>         cls.raise_changelog_err()
>>>
>>>       File
>>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>>     line 30, in raise_changelog_err
>>>
>>>         raise ChangelogException(errn, os.strerror(errn))
>>>
>>>     ChangelogException: [Errno 2] No such file or directory
>>>
>>>     [2020-06-25 07:07:19.297161] E [repce(worker
>>>     /rhgs/brick20/brick):213:__call__] RepceClient: call failed
>>>     call=6799:140380783982400:1593068837.29 method=register
>>>     error=ChangelogException
>>>
>>>     [2020-06-25 07:07:19.297338] E [resource(worker
>>>     /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog
>>>     register failed      error=[Errno 2] No such file or directory
>>>
>>>     [2020-06-25 07:07:19.315074] I [repce(agent
>>>     /rhgs/brick20/brick):96:service_loop] RepceServer: terminating
>on
>>>     reaching EOF.
>>>
>>>     [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor]
>>>     Monitor: worker died in startup phase    
>brick=/rhgs/brick20/brick
>>>
>>>     [2020-06-25 07:07:20.277383] I
>>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>>     Worker Status Change status=Faulty
>>>
>>>     We’ve done everything we can think of, including an “strace –f”
>>>     on the pid, and we can’t really find anything. I’m about to lose
>>>     the last of my hair over this, so does anyone have any ideas at
>>>     all? We’ve even removed the entire slave vol and rebuilt it.
>>>
>>>     Thanks
>>>
>>>     Rob
>>>
>>>     *Rob Quagliozzi*
>>>
>>>     *Specialised Application Support*
>>>
>>>
>>>
>>>    
>------------------------------------------------------------------------
>>>     This email (including any attachments to it) is confidential,
>>>     legally privileged, subject to copyright and is sent for the
>>>     personal attention of the intended recipient only. If you have
>>>     received this email in error, please advise us immediately and
>>>     delete it. You are notified that disclosing, copying,
>>>     distributing or taking any action in reliance on the contents of
>>>     this information is strictly prohibited. Although we have taken
>>>     reasonable precautions to ensure no viruses are present in this
>>>     email, we cannot accept responsibility for any loss or damage
>>>     arising from the viruses in this email or attachments. We
>exclude
>>>     any liability for the content of this email, or for the
>>>     consequences of any actions taken on the basis of the
>information
>>>     provided in this email or its attachments, unless that
>>>     information is subsequently confirmed in writing. <#rbnl#1898i>
>>>    
>------------------------------------------------------------------------
>>>
>>>
>>>     ________
>>>
>>>
>>>
>>>     Community Meeting Calendar:
>>>
>>>     Schedule -
>>>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>     Bridge:https://bluejeans.com/441850968
>>>
>>>     Gluster-users mailing list
>>>     Gluster-users@xxxxxxxxxxx  <mailto:Gluster-users@xxxxxxxxxxx>
>>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>     ________
>>
>>
>>
>>     Community Meeting Calendar:
>>
>>     Schedule -
>>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>     Bridge: https://bluejeans.com/441850968
>>
>>     Gluster-users mailing list
>>     Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> https://lists.gluster.org/mailman/listinfo/gluster-users
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users