Re: Geo-replication completely broken

Felix Kölzow <felix.koelzow@xxxxxx> · Fri, 3 Jul 2020 10:16:30 +0200

    Dear Users,

    the geo-replication is still broken. This is not really a
    comfortable situation.

    Does any user has had the same experience and is able to share a
    possible workaround?

    We are actually running gluster v6.0

    Regards,

    Felix

    On 25/06/2020 10:04, Shwetha Acharya
      wrote:

      Hi Rob and Felix,

        Please share the *-changes.log files and brick logs, which will
        help in analysis of the issue.

        Regards,
        Shwetha

        On Thu, Jun 25, 2020 at 1:26
          PM Felix Kölzow <felix.koelzow@xxxxxx> wrote:

            Hey Rob,

            same issue for our third volume. Have a look at the logs
              just from right now (below).
            Question: You removed the htime files and the old
              changelogs. Just rm the files or is there something to pay
              more attention
            before removing the changelog files and the htime file.
            Regards,
            Felix

            [2020-06-25 07:51:53.795430] I [resource(worker
              /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote]
              SSH: SSH connection between master and slave
              established.    duration=1.2341

              [2020-06-25 07:51:53.795639] I [resource(worker
              /gluster/vg00/dispersed_fuse1024/brick):1105:connect]
              GLUSTER: Mounting gluster volume locally...

              [2020-06-25 07:51:54.520601] I
              [monitor(monitor):280:monitor] Monitor: worker died in
              startup phase   
              brick=/gluster/vg01/dispersed_fuse1024/brick

              [2020-06-25 07:51:54.535809] I
              [gsyncdstatus(monitor):248:set_worker_status]
              GeorepStatus: Worker Status Change    status=Faulty

              [2020-06-25 07:51:54.882143] I [resource(worker
              /gluster/vg00/dispersed_fuse1024/brick):1128:connect]
              GLUSTER: Mounted gluster volume    duration=1.0864

              [2020-06-25 07:51:54.882388] I [subcmds(worker
              /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker]
              <top>: Worker spawn successful. Acknowledging back
              to monitor

              [2020-06-25 07:51:56.911412] E [repce(agent
              /gluster/vg00/dispersed_fuse1024/brick):121:worker]
              <top>: call failed: 

              Traceback (most recent call last):

                File
              "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
              117, in worker

                  res = getattr(self.obj, rmeth)(*in_data[2:])

                File
              "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
              line 40, in register

                  return Changes.cl_register(cl_brick, cl_dir, cl_log,
              cl_level, retries)

                File
              "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
              line 46, in cl_register

                  cls.raise_changelog_err()

                File
              "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
              line 30, in raise_changelog_err

                  raise ChangelogException(errn, os.strerror(errn))

              ChangelogException: [Errno 2] No such file or directory

              [2020-06-25 07:51:56.912056] E [repce(worker
              /gluster/vg00/dispersed_fuse1024/brick):213:__call__]
              RepceClient: call failed   
              call=75086:140098349655872:1593071514.91   
              method=register    error=ChangelogException

              [2020-06-25 07:51:56.912396] E [resource(worker
              /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
              GLUSTER: Changelog register failed    error=[Errno 2] No
              such file or directory

              [2020-06-25 07:51:56.928031] I [repce(agent
              /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
              RepceServer: terminating on reaching EOF.

              [2020-06-25 07:51:57.886126] I
              [monitor(monitor):280:monitor] Monitor: worker died in
              startup phase   
              brick=/gluster/vg00/dispersed_fuse1024/brick

              [2020-06-25 07:51:57.895920] I
              [gsyncdstatus(monitor):248:set_worker_status]
              GeorepStatus: Worker Status Change    status=Faulty

              [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
              /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
              GeorepStatus: Worker Status Change    status=Passive

              [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
              /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
              GeorepStatus: Worker Status Change    status=Passive

              [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
              /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
              GeorepStatus: Worker Status Change    status=Active

            On 25/06/2020 09:15, Rob.Quagliozzi@xxxxxxxxxxxx
              wrote:

                Hi All,

                We’ve got two six node RHEL 7.8
                  clusters and geo-replication would appear to be
                  completely broken between them. I’ve deleted the
                  session, removed & recreated pem files, old
                  changlogs/htime (after removing relevant options from
                  volume) and completely set up geo-rep from scratch,
                  but the new session comes up as Initializing, then
                  goes faulty, and starts looping. Volume (on both
                  sides) is a 4 x 2 disperse, running Gluster v6 (RH
                  latest).  Gsyncd reports:

                [2020-06-25 07:07:14.701423] I
                  [gsyncdstatus(monitor):248:set_worker_status]
                  GeorepStatus: Worker Status Change
                  status=Initializing...
                [2020-06-25 07:07:14.701744] I
                  [monitor(monitor):159:monitor] Monitor: starting
                  gsyncd worker   brick=/rhgs/brick20/brick      
                  slave_node=bxts470194.eu.rabonet.com
                [2020-06-25 07:07:14.707997] D
                  [monitor(monitor):230:monitor] Monitor: Worker would
                  mount volume privately
                [2020-06-25 07:07:14.757181] I
                  [gsyncd(agent /rhgs/brick20/brick):318:main]
                  <top>: Using session config file   
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
                [2020-06-25 07:07:14.758126] D
                  [subcmds(agent /rhgs/brick20/brick):107:subcmd_agent]
                  <top>: RPC FD      rpc_fd='5,12,11,10'
                [2020-06-25 07:07:14.758627] I
                  [changelogagent(agent
                  /rhgs/brick20/brick):72:__init__] ChangelogAgent:
                  Agent listining...
                [2020-06-25 07:07:14.764234] I
                  [gsyncd(worker /rhgs/brick20/brick):318:main]
                  <top>: Using session config file  
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
                [2020-06-25 07:07:14.779409] I
                  [resource(worker
                  /rhgs/brick20/brick):1386:connect_remote] SSH:
                  Initializing SSH connection between master and
                  slave...
                [2020-06-25 07:07:14.841793] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068834.84
                  __repce_version__() ...
                [2020-06-25 07:07:16.148725] D
                  [repce(worker /rhgs/brick20/brick):215:__call__]
                  RepceClient: call 6799:140380783982400:1593068834.84
                  __repce_version__ -> 1.0
                [2020-06-25 07:07:16.148911] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068836.15
                  version() ...
                [2020-06-25 07:07:16.149574] D
                  [repce(worker /rhgs/brick20/brick):215:__call__]
                  RepceClient: call 6799:140380783982400:1593068836.15
                  version -> 1.0
                [2020-06-25 07:07:16.149735] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068836.15
                  pid() ...
                [2020-06-25 07:07:16.150588] D
                  [repce(worker /rhgs/brick20/brick):215:__call__]
                  RepceClient: call 6799:140380783982400:1593068836.15
                  pid -> 30703
                [2020-06-25 07:07:16.150747] I
                  [resource(worker
                  /rhgs/brick20/brick):1435:connect_remote] SSH: SSH
                  connection between master and slave established.    
                  duration=1.3712
                [2020-06-25 07:07:16.150819] I
                  [resource(worker /rhgs/brick20/brick):1105:connect]
                  GLUSTER: Mounting gluster volume locally...
                [2020-06-25 07:07:16.265860] D
                  [resource(worker /rhgs/brick20/brick):879:inhibit]
                  DirectMounter: auxiliary glusterfs mount in place
                [2020-06-25 07:07:17.272511] D
                  [resource(worker /rhgs/brick20/brick):953:inhibit]
                  DirectMounter: auxiliary glusterfs mount prepared
                [2020-06-25 07:07:17.272708] I
                  [resource(worker /rhgs/brick20/brick):1128:connect]
                  GLUSTER: Mounted gluster volume      duration=1.1218
                [2020-06-25 07:07:17.272794] I
                  [subcmds(worker /rhgs/brick20/brick):84:subcmd_worker]
                  <top>: Worker spawn successful. Acknowledging
                  back to monitor
                [2020-06-25 07:07:17.272973] D
                  [master(worker
                  /rhgs/brick20/brick):104:gmaster_builder] <top>:
                  setting up change detection mode mode=xsync
                [2020-06-25 07:07:17.273063] D
                  [monitor(monitor):273:monitor] Monitor:
                  worker(/rhgs/brick20/brick) connected
                [2020-06-25 07:07:17.273678] D
                  [master(worker
                  /rhgs/brick20/brick):104:gmaster_builder] <top>:
                  setting up change detection mode mode=changelog
                [2020-06-25 07:07:17.274224] D
                  [master(worker
                  /rhgs/brick20/brick):104:gmaster_builder] <top>:
                  setting up change detection mode mode=changeloghistory
                [2020-06-25 07:07:17.276484] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068837.28
                  version() ...
                [2020-06-25 07:07:17.276916] D
                  [repce(worker /rhgs/brick20/brick):215:__call__]
                  RepceClient: call 6799:140380783982400:1593068837.28
                  version -> 1.0
                [2020-06-25 07:07:17.277009] D
                  [master(worker
                  /rhgs/brick20/brick):777:setup_working_dir] _GMaster:
                  changelog working dir
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
                [2020-06-25 07:07:17.277098] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068837.28
                  init() ...
                [2020-06-25 07:07:17.292944] D
                  [repce(worker /rhgs/brick20/brick):215:__call__]
                  RepceClient: call 6799:140380783982400:1593068837.28
                  init -> None
                [2020-06-25 07:07:17.293097] D
                  [repce(worker /rhgs/brick20/brick):195:push]
                  RepceClient: call 6799:140380783982400:1593068837.29
                  register('/rhgs/brick20/brick',
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
                  8, 5) ...
                [2020-06-25 07:07:19.296294] E
                  [repce(agent /rhgs/brick20/brick):121:worker]
                  <top>: call failed:
                Traceback (most recent call last):
                  File
                  "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
                  line 117, in worker
                    res = getattr(self.obj,
                  rmeth)(*in_data[2:])
                  File
                  "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
                  line 40, in register
                    return
                  Changes.cl_register(cl_brick, cl_dir, cl_log,
                  cl_level, retries)
                  File
                  "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
                  line 46, in cl_register
                    cls.raise_changelog_err()
                  File
                  "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
                  line 30, in raise_changelog_err
                    raise ChangelogException(errn,
                  os.strerror(errn))
                ChangelogException: [Errno 2] No
                  such file or directory
                [2020-06-25 07:07:19.297161] E
                  [repce(worker /rhgs/brick20/brick):213:__call__]
                  RepceClient: call failed       
                  call=6799:140380783982400:1593068837.29
                  method=register error=ChangelogException
                [2020-06-25 07:07:19.297338] E
                  [resource(worker
                  /rhgs/brick20/brick):1286:service_loop] GLUSTER:
                  Changelog register failed      error=[Errno 2] No such
                  file or directory
                [2020-06-25 07:07:19.315074] I
                  [repce(agent /rhgs/brick20/brick):96:service_loop]
                  RepceServer: terminating on reaching EOF.
                [2020-06-25 07:07:20.275701] I
                  [monitor(monitor):280:monitor] Monitor: worker died in
                  startup phase     brick=/rhgs/brick20/brick
                [2020-06-25 07:07:20.277383] I
                  [gsyncdstatus(monitor):248:set_worker_status]
                  GeorepStatus: Worker Status Change status=Faulty

                We’ve done everything we can think
                  of, including an “strace –f” on the pid, and we can’t
                  really find anything. I’m about to lose the last of my
                  hair over this, so does anyone have any ideas at all?
                  We’ve even removed the entire slave vol and rebuilt
                  it.

                Thanks
                Rob

                Rob Quagliozzi
                Specialised Application Support

               This email (including any attachments to it) is
              confidential, legally privileged, subject to copyright and
              is sent for the personal attention of the intended
              recipient only. If you have received this email in error,
              please advise us immediately and delete it. You are
              notified that disclosing, copying, distributing or taking
              any action in reliance on the contents of this information
              is strictly prohibited. Although we have taken reasonable
              precautions to ensure no viruses are present in this
              email, we cannot accept responsibility for any loss or
              damage arising from the viruses in this email or
              attachments. We exclude any liability for the content of
              this email, or for the consequences of any actions taken
              on the basis of the information provided in this email or
              its attachments, unless that information is subsequently
              confirmed in writing. 
                <#rbnl#1898i>

              ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

          ________

          Community Meeting Calendar:

          Schedule -

          Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

          Bridge: https://bluejeans.com/441850968

          Gluster-users mailing list

          Gluster-users@xxxxxxxxxxx

          https://lists.gluster.org/mailman/listinfo/gluster-users

      ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users