Re: no progress in geo-replication

Dietmar Putz <dietmar.putz@3q.video> · Fri, 12 Mar 2021 11:53:54 +0100

    Hi Shwetha,
    thank you for your reply...

    I ran a few tests in Debug Mode and found no real indication of
      the problem. After each start of the geo-replication some files
      are transferred at the beginning and then no further transfers.
    Few minutes after start the amount of changelog files in
      <brick> looks like : 

    [ 06:42:26 ] - root@gl-master-02  ~/tmp
        $./var_gluster.sh 

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed
        : 0                           

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing
        : 27                         ### growing whole time

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
        : 324861

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed
        : 1

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing
        : 324859            ###finished building changelog files 

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current
        : 0

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current
        : 0

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync
        : 0

        ---
    As far as i remember at the beginning i have seen a few changelog
      files in <brick>/.processed for a short moment, but always
      with size 0. Even after some hours there are no files in
      <brick>/.processed.

    In strace are a lot of of messages like 'failed: No data
      available' and 'rsync error: some files/attrs were not transfered
      ... (code 23)' for about the first 5-10 minutes after geo-rep
      start.
    for example gfid 8d601e5b-180c.... :

    19361 1615530800.812727 select(7, [6], [],
        [], NULL <unfinished ...>

        19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0,
        tv_usec=235797} <unfinished ...>

        19352 1615530800.816522
        lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.817723
        lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.819507
        lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.821106
        lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.822874
        lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c", 
        <unfinished ...>

        19418 1615530800.823466 <... restart_syscall resumed> ) =
        1

        19418 1615530800.823519 read(14, "rsync: get_xattr_data:
lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0)
        failed: No data available (61)\n", 32768) = 171

        19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1
        <unfinished ...>

        19352 1615530800.823830 <... lstat resumed>
        {st_mode=S_IFREG|0644, st_size=4226767, ...}) = 0

        19352 1615530800.823882
        lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.897938
        lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.934281
        lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e",
        {st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0

        19352 1615530800.935938
        lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9",
        {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

        19352 1615530800.937481
        lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9", 
        <unfinished ...>

        19417 1615530800.961937 <... restart_syscall resumed> ) =
        1

        19417 1615530800.962042 read(13, "rsync error: some files/attrs
        were not transferred (see previous errors) (code 23) at
        main.c(1196) [sender=3.1.2]\n", 32768) = 114
    according gsyncd.log this gfid is candidate for syncing like many
      others, very few others are 'synced' :

        [2021-03-12 06:33:20.651147] D [master(worker
        /brick1/mvol1):318:a_syncdata] _GMaster: candidate for
        syncing    file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

        [2021-03-12 06:35:17.419920] D [master(worker
        /brick1/mvol1):318:a_syncdata] _GMaster: candidate for
        syncing    file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
    [2021-03-12 06:35:03.382977] D
        [master(worker /brick1/mvol1):324:regjob] _GMaster: synced   
        file=.gfid/a3656075-784c-4377-a482-4aad8378acf0

    when i try to fetch some attributes i get 'No such attribute' 
      for the mentioned gfid while for a synced gfid attributes are
      available :

    [ 09:58:54 ] - root@gl-master-02  ~/tmp
      $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

    /tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234:
      trusted.glusterfs.mdata: No such attribute

    [ 09:59:38 ] - root@gl-master-02  ~/tmp
      $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0

    getfattr: Removing leading '/' from absolute
      path names

    # file:
      tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0

    trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6

    i can make a stat on that directory for example :
    [ 10:07:19 ] - root@gl-master-02  ~/tmp
        $stat
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

          File:
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

          Size: 4096          Blocks: 8          IO Block: 131072
        directory

        Device: 37h/55d    Inode: 9394601311212820456  Links: 2

        Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (   
        0/    root)

        Access: 2021-01-17 07:26:09.596743288 +0000

        Modify: 2021-03-12 07:34:39.382122663 +0000

        Change: 2021-03-12 07:34:39.383446790 +0000

         Birth: -

    currently i have no clou how to get this fixed. obviously
      extended attributes are missing for the most entries in
      /tmp/gsyncd-aux-mount.../ while some others exist.
    I believe rsync is not the reason, it's rather because of missing
      attributes ?!
    Lastly the gfid points to a directory, when i try to get
      attributes for this directroy in brick-path it succeed's :

    [ 10:15:25 ] - root@gl-master-02  ~/tmp $ls
      -l
      /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234

      lrwxrwxrwx 1 root root 56 Nov  5 17:54
      /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
      -> ../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547

    [ 10:38:01 ] - root@gl-master-02  ~ $getfattr
      -m . -d -e hex /brick1/mvol1/2137/files/20/11/2217547

      getfattr: Removing leading '/' from absolute path names

      # file: brick1/mvol1/2137/files/20/11/2217547

      trusted.gfid=0x8d601e5b180c46c8b64fae6224542234

trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528

      trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89

      trusted.glusterfs.dht.mds=0x00000000

      [ 10:38:42 ] - root@gl-master-02  ~ $getfattr -m . -d -e hex
      /brick1/mvol1/2137/files/20/11

      getfattr: Removing leading '/' from absolute path names

      # file: brick1/mvol1/2137/files/20/11

      trusted.gfid=0xf094bf0628064f909a79489827c6cdf9

trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528

      trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff

trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744

      [ 10:39:54 ] - root@gl-master-02  ~ $

    but the geo-rep ended up in a loop, but without 'E'error :
    [2021-03-12 10:46:40.572500] D
        [repce(worker /brick1/mvol1):215:__call__] RepceClient: call
        19352:140387951818496:1615546000.5609794 keep_alive -> 256

        [2021-03-12 10:46:41.23154] D [master(worker
        /brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took
        5.017846 seconds

        [2021-03-12 10:46:41.35729] D [master(worker
        /brick2/mvol1):578:crawlwrap] _GMaster: Crawl info   
        cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)

        [2021-03-12 10:46:46.41012] D [master(worker
        /brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took
        5.017512 seconds

        [2021-03-12 10:46:46.53818] D [master(worker
        /brick2/mvol1):578:crawlwrap] _GMaster: Crawl info   
        cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)

        [2021-03-12 10:46:48.269174] D [repce(worker
        /brick2/mvol1):195:push] RepceClient: call
        19354:140476158043904:1615546008.2690222 keep_alive({'version':
        (1, 0), 'uuid': '2f5de6e4-66de-40a7-9f24-4762aad3ca96',
        'retval': 0, 'volume_mark': (1609275788, 819193), 'timeout':
        1615546128},) ...

    Does anyone have any idea how to solve this problem ?
    best regards,
    Dietar

    On 04.03.21 08:48, Shwetha Acharya
      wrote:

        Hi Dietmar,

        batch-fsync-delay-usec was already set to 0 and I
          increased the sync_jobs from 3 to 6. In the moment I increased
          the sync_jobs following error appeared in gsyncd.log :

            [2021-03-03 23:17:46.59727] E [syncdutils(worker
              /brick1/mvol1):312:log_raise_exception] <top>:
              connection to peer is broken

              [2021-03-03 23:17:46.59912] E [syncdutils(worker
              /brick2/mvol1):312:log_raise_exception] <top>:
              connection to peer is broken

          If the geo-rep session is currently not in faulty state,
            we should be bothered about this log message. It is normal
            when the config is updated, geo-rep restart occurs and the
            above message pops up.

              passive nodes became active and the content in
                <brick>/.processing was removed. currently new
                changelog files are created in this directory.shortly
                before I changed the sync_jobs I have checked the
                <brick>/.processing directory on the master nodes.
                the result was the same for every master node.

            since the last error about 12 hours ago nearly 2400
              changelog files were created on each node but it looks
              like none of them were consumed.

           Processed changelogs that are synced are archived under
            <brick>/.processed directory. Verify if the latest
            file is created there.

              in the moment I'm not sure what is right and what is
                wrong...should at least the
                      oldest changelog files in this directory have been
                      processed gradually ?

          Also you can try to set the log-level to debug for a
            while and set it back to info(to avoid flooding of logs) and
            check the logs to get a better picture of the scenario.
          #gluster volume geo-replication <primary>
            <ip>::<secondary> config log-level DEBUG

            #gluster volume geo-replication <primary>
            <ip>::<secondary> config log-level INFO

            Regards,

            Shwetha

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users