Re: Geo-replication completely broken

Shwetha Acharya <sacharya@xxxxxxxxxx> · Thu, 25 Jun 2020 13:34:23 +0530

Hi Rob and Felix,

Please share the *-changes.log files and brick logs, which will help in analysis of the issue.

Regards,Shwetha

On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koelzow@xxxxxx> wrote:

    Hey Rob,

    same issue for our third volume. Have a look at the logs just
      from right now (below).
    Question: You removed the htime files and the old changelogs.
      Just rm the files or is there something to pay more attention
    before removing the changelog files and the htime file.
    Regards,
    Felix

    [2020-06-25 07:51:53.795430] I [resource(worker
      /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
      SSH connection between master and slave established.   
      duration=1.2341

      [2020-06-25 07:51:53.795639] I [resource(worker
      /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
      Mounting gluster volume locally...

      [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
      Monitor: worker died in startup phase   
      brick=/gluster/vg01/dispersed_fuse1024/brick

      [2020-06-25 07:51:54.535809] I
      [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
      Status Change    status=Faulty

      [2020-06-25 07:51:54.882143] I [resource(worker
      /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
      Mounted gluster volume    duration=1.0864

      [2020-06-25 07:51:54.882388] I [subcmds(worker
      /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker]
      <top>: Worker spawn successful. Acknowledging back to
      monitor

      [2020-06-25 07:51:56.911412] E [repce(agent
      /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>:
      call failed: 

      Traceback (most recent call last):

        File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
      117, in worker

          res = getattr(self.obj, rmeth)(*in_data[2:])

        File
      "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
      40, in register

          return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
      retries)

        File
      "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
      46, in cl_register

          cls.raise_changelog_err()

        File
      "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
      30, in raise_changelog_err

          raise ChangelogException(errn, os.strerror(errn))

      ChangelogException: [Errno 2] No such file or directory

      [2020-06-25 07:51:56.912056] E [repce(worker
      /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient:
      call failed    call=75086:140098349655872:1593071514.91   
      method=register    error=ChangelogException

      [2020-06-25 07:51:56.912396] E [resource(worker
      /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
      GLUSTER: Changelog register failed    error=[Errno 2] No such file
      or directory

      [2020-06-25 07:51:56.928031] I [repce(agent
      /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
      RepceServer: terminating on reaching EOF.

      [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
      Monitor: worker died in startup phase   
      brick=/gluster/vg00/dispersed_fuse1024/brick

      [2020-06-25 07:51:57.895920] I
      [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
      Status Change    status=Faulty

      [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
      /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
      GeorepStatus: Worker Status Change    status=Passive

      [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
      /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
      GeorepStatus: Worker Status Change    status=Passive

      [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
      /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
      GeorepStatus: Worker Status Change    status=Active

    On 25/06/2020 09:15,
      Rob.Quagliozzi@xxxxxxxxxxxx wrote:

        Hi All,

        We’ve got two six node RHEL 7.8 clusters
          and geo-replication would appear to be completely broken
          between them. I’ve deleted the session, removed &
          recreated pem files, old changlogs/htime (after removing
          relevant options from volume) and completely set up geo-rep
          from scratch, but the new session comes up as Initializing,
          then goes faulty, and starts looping. Volume (on both sides)
          is a 4 x 2 disperse, running Gluster v6 (RH latest).  Gsyncd
          reports:

        [2020-06-25 07:07:14.701423] I
          [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
          Worker Status Change status=Initializing...
        [2020-06-25 07:07:14.701744] I
          [monitor(monitor):159:monitor] Monitor: starting gsyncd
          worker   brick=/rhgs/brick20/brick      
          slave_node=bxts470194.eu.rabonet.com
        [2020-06-25 07:07:14.707997] D
          [monitor(monitor):230:monitor] Monitor: Worker would mount
          volume privately
        [2020-06-25 07:07:14.757181] I
          [gsyncd(agent /rhgs/brick20/brick):318:main] <top>:
          Using session config file   
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
        [2020-06-25 07:07:14.758126] D
          [subcmds(agent /rhgs/brick20/brick):107:subcmd_agent]
          <top>: RPC FD      rpc_fd='5,12,11,10'
        [2020-06-25 07:07:14.758627] I
          [changelogagent(agent /rhgs/brick20/brick):72:__init__]
          ChangelogAgent: Agent listining...
        [2020-06-25 07:07:14.764234] I
          [gsyncd(worker /rhgs/brick20/brick):318:main] <top>:
          Using session config file  
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
        [2020-06-25 07:07:14.779409] I
          [resource(worker /rhgs/brick20/brick):1386:connect_remote]
          SSH: Initializing SSH connection between master and slave...
        [2020-06-25 07:07:14.841793] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068834.84 __repce_version__() ...
        [2020-06-25 07:07:16.148725] D
          [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient:
          call 6799:140380783982400:1593068834.84 __repce_version__
          -> 1.0
        [2020-06-25 07:07:16.148911] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068836.15 version() ...
        [2020-06-25 07:07:16.149574] D
          [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient:
          call 6799:140380783982400:1593068836.15 version -> 1.0
        [2020-06-25 07:07:16.149735] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068836.15 pid() ...
        [2020-06-25 07:07:16.150588] D
          [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient:
          call 6799:140380783982400:1593068836.15 pid -> 30703
        [2020-06-25 07:07:16.150747] I
          [resource(worker /rhgs/brick20/brick):1435:connect_remote]
          SSH: SSH connection between master and slave established.    
          duration=1.3712
        [2020-06-25 07:07:16.150819] I
          [resource(worker /rhgs/brick20/brick):1105:connect] GLUSTER:
          Mounting gluster volume locally...
        [2020-06-25 07:07:16.265860] D
          [resource(worker /rhgs/brick20/brick):879:inhibit]
          DirectMounter: auxiliary glusterfs mount in place
        [2020-06-25 07:07:17.272511] D
          [resource(worker /rhgs/brick20/brick):953:inhibit]
          DirectMounter: auxiliary glusterfs mount prepared
        [2020-06-25 07:07:17.272708] I
          [resource(worker /rhgs/brick20/brick):1128:connect] GLUSTER:
          Mounted gluster volume      duration=1.1218
        [2020-06-25 07:07:17.272794] I
          [subcmds(worker /rhgs/brick20/brick):84:subcmd_worker]
          <top>: Worker spawn successful. Acknowledging back to
          monitor
        [2020-06-25 07:07:17.272973] D
          [master(worker /rhgs/brick20/brick):104:gmaster_builder]
          <top>: setting up change detection mode mode=xsync
        [2020-06-25 07:07:17.273063] D
          [monitor(monitor):273:monitor] Monitor:
          worker(/rhgs/brick20/brick) connected
        [2020-06-25 07:07:17.273678] D
          [master(worker /rhgs/brick20/brick):104:gmaster_builder]
          <top>: setting up change detection mode mode=changelog
        [2020-06-25 07:07:17.274224] D
          [master(worker /rhgs/brick20/brick):104:gmaster_builder]
          <top>: setting up change detection mode
          mode=changeloghistory
        [2020-06-25 07:07:17.276484] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068837.28 version() ...
        [2020-06-25 07:07:17.276916] D
          [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient:
          call 6799:140380783982400:1593068837.28 version -> 1.0
        [2020-06-25 07:07:17.277009] D
          [master(worker /rhgs/brick20/brick):777:setup_working_dir]
          _GMaster: changelog working dir
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
        [2020-06-25 07:07:17.277098] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068837.28 init() ...
        [2020-06-25 07:07:17.292944] D
          [repce(worker /rhgs/brick20/brick):215:__call__] RepceClient:
          call 6799:140380783982400:1593068837.28 init -> None
        [2020-06-25 07:07:17.293097] D
          [repce(worker /rhgs/brick20/brick):195:push] RepceClient: call
          6799:140380783982400:1593068837.29
          register('/rhgs/brick20/brick',
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
          8, 5) ...
        [2020-06-25 07:07:19.296294] E [repce(agent
          /rhgs/brick20/brick):121:worker] <top>: call failed:
        Traceback (most recent call last):
          File
          "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117,
          in worker
            res = getattr(self.obj,
          rmeth)(*in_data[2:])
          File
          "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
          line 40, in register
            return Changes.cl_register(cl_brick,
          cl_dir, cl_log, cl_level, retries)
          File
          "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
          line 46, in cl_register
            cls.raise_changelog_err()
          File
          "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
          line 30, in raise_changelog_err
            raise ChangelogException(errn,
          os.strerror(errn))
        ChangelogException: [Errno 2] No such file
          or directory
        [2020-06-25 07:07:19.297161] E
          [repce(worker /rhgs/brick20/brick):213:__call__] RepceClient:
          call failed        call=6799:140380783982400:1593068837.29
          method=register error=ChangelogException
        [2020-06-25 07:07:19.297338] E
          [resource(worker /rhgs/brick20/brick):1286:service_loop]
          GLUSTER: Changelog register failed      error=[Errno 2] No
          such file or directory
        [2020-06-25 07:07:19.315074] I [repce(agent
          /rhgs/brick20/brick):96:service_loop] RepceServer: terminating
          on reaching EOF.
        [2020-06-25 07:07:20.275701] I
          [monitor(monitor):280:monitor] Monitor: worker died in startup
          phase     brick=/rhgs/brick20/brick
        [2020-06-25 07:07:20.277383] I
          [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
          Worker Status Change status=Faulty

        We’ve done everything we can think of,
          including an “strace –f” on the pid, and we can’t really find
          anything. I’m about to lose the last of my hair over this, so
          does anyone have any ideas at all? We’ve even removed the
          entire slave vol and rebuilt it.

        Thanks
        Rob

        Rob Quagliozzi
        Specialised Application Support

      This email (including any attachments to it) is confidential,
      legally privileged, subject to copyright and is sent for the
      personal attention of the intended recipient only. If you have
      received this email in error, please advise us immediately and
      delete it. You are notified that disclosing, copying, distributing
      or taking any action in reliance on the contents of this
      information is strictly prohibited. Although we have taken
      reasonable precautions to ensure no viruses are present in this
      email, we cannot accept responsibility for any loss or damage
      arising from the viruses in this email or attachments. We exclude
      any liability for the content of this email, or for the
      consequences of any actions taken on the basis of the information
      provided in this email or its attachments, unless that information
      is subsequently confirmed in writing. 
        <#rbnl#1898i>

      ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

Bridge: https://bluejeans.com/441850968

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users