Re: no progress in geo-replication

Dietmar Putz <dietmar.putz@3q.video> · Thu, 4 Mar 2021 01:12:44 +0100

    Hello Felix,
    thank you for your reply...
    batch-fsync-delay-usec was already set to 0 and I increased the
      sync_jobs from 3 to 6. In the moment I increased the sync_jobs
      following error appeared in gsyncd.log :

    [2021-03-03 23:17:46.59727] E [syncdutils(worker
      /brick1/mvol1):312:log_raise_exception] <top>: connection to
      peer is broken

      [2021-03-03 23:17:46.59912] E [syncdutils(worker
      /brick2/mvol1):312:log_raise_exception] <top>: connection to
      peer is broken
    passive nodes became active and the content in
      <brick>/.processing was removed. currently new changelog
      files are created in this directory.
    shortly before I changed the sync_jobs I have checked the
      <brick>/.processing directory on the master nodes. the
      result was the same for every master node.
    since the last error about 12 hours ago nearly 2400 changelog
      files were created on each node but it looks like none of them
      were consumed.
    in the moment I'm not sure what is right and what is wrong...should
            at least the oldest changelog files in this directory have
            been processed gradually ? 
    best regards
    Dietmar

    [ 00:12:57 ] - putz@centreon-3qmedien 
      ~/central $./mycommand.sh -H gl-master -c "ls -l
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing/"

    Host : gl-master-01

    total 9824

    -rw-r--r-- 1 root root  268 Mar  3 12:02
      CHANGELOG.1614772950

    -rw-r--r-- 1 root root 1356 Mar  3 12:02
      CHANGELOG.1614772965

    -rw-r--r-- 1 root root 1319 Mar  3 12:03
      CHANGELOG.1614772980

    ...

    -rw-r--r-- 1 root root  693 Mar  3 23:10
      CHANGELOG.1614813002

      -rw-r--r-- 1 root root   48 Mar  3 23:12 CHANGELOG.1614813170

      -rw-r--r-- 1 root root 1222 Mar  3 23:13 CHANGELOG.1614813226

    On 03.03.21 20:41, Felix Kölzow wrote:

      Dear Dietmar, 

      I am very interested in helping you with that geo-replication,
        since we also have a setup with geo-replication that is crucial
        for the 

      backup procedure. I just had a quick look at this and for the
        moment, I just can suggest:

      is there
          any suitable setting in the gluster-environment which would
          take influence on the speed of the processing (current
          settings attached) ?
      gluster volume geo-replication mvol1
        gl-slave-05-int::svol  config sync_jobs  9

      in order to
            increase the number of rsync processes. 

      Furthermore,
            taken from
            https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/recommended_practices3

            Performance Tuning
             When the following option is set, it has been observed
              that there is an increase in geo-replication performance.
              On the slave volume, run the following command: 

          # gluster volume set SLAVE_VOL batch-fsync-delay-usec 0

      Can you
            verify that the changelog-files are consumed?

      Regards,
      Felix
      On 03/03/2021 17:28, Dietmar Putz
        wrote:

        Hi,
        I'm
                having a problem with geo-replication. A
                short summary... 
        About two month ago I have
          added two further nodes to a distributed replicated volume.
          For that purpose I have stopped the geo-replication, added two
          nodes on mvol and svol and started a rebalance process on both
          sides. Once the rebalance process was finished I startet the
          geo-replication again.

          After a few days and beside some Unicode Errors the status of
          the new added brick changed from hybrid crawl to history
          crawl. Since then no progress, no files / directories have
          been created on svol for a couple of days. 

        Looking for a possible reason
          I recognized that there is was
          /var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1
          directory on the new added slave nodes.
        Obviously I forgot to add the
          new svol node IP addresses on all master's /etc/hosts. After
          fixing that I did the '... execute gsec_create' and '...create
          push-pem force' command again and corresponding directory were
          created. Geo-replication started normal, all active sessions
          were in history crawl (as shown below) and for a short while
          some data were transfered to svol. But for about a week
          nothing had changed on svol, 0 byte transferred.

        Meanwhile i have deleted
          (without reset-sync-time) and recreated the geo-rep session.
          the current status is as shown below but without any
          last_synced date.
        an entry like
          "last_synced_entry": 1609283145 is still visible in
          /var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/*status
          and changelog files are continously created in
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/<brick>/.processing.

        Short time ago i changed
          log_level to DEBUG for a moment. Unfortunately I got an
          'EOFError: Ran out of input' in gsyncd.log and rebuild of
          .processing starts from beginning.  

        But one of the first very
          long lines in gsyncd.log looks like :

        [2021-03-03
            11:59:39.503881] D [repce(worker
            /brick1/mvol1):215:__call__] RepceClient: call
            9163:139944064358208:1614772779.4982471 history_getchanges
            ->
['/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing/CHANGELOG.1609280278',...

        1609280278
            means Tuesday, December 29, 2020 10:17:58 PM and would
            somehow fit to the last_synced date.

        However,
            I got nearly 300k files in
            <brick>/.history/.processing and in in log/trace it
            seems that any file in <brick>/.history/.processing
            will be processed and transferred to
            <brick>/.processing. 

        My
            questions so far...
        first of
            all, is everything still ok with this geo-replication ?
        do i have
            to wait until all changelog files in
            <brick>/.history/.processing are processed until
            transfers to svol start ?
        what
            happens if any other error appears in geo-replication while
            these changelog files are processed resp. crawl status is
            history crawl ... does the entire process starts from the
            beginning ? would a checkpiont be helpful...for future
            decisions...?
        is there
            any suitable setting in the gluster-environment which would
            take influence on the speed of the processing (current
            settings attached) ?

        I hope
            someone can help...

        best
            regards
        dietmar

         [
            15:17:47 ] - root@gl-master-01 
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
            $ls .processing/ | wc -l

            294669

        [
            12:56:31 ] - root@gl-master-01  ~ $gluster volume
            geo-replication mvol1 gl-slave-01-int::svol1 status

            MASTER NODE         MASTER VOL    MASTER BRICK     SLAVE
            USER SLAVE                     SLAVE NODE         STATUS    
            CRAWL STATUS     LAST_SYNCED

----------------------------------------------------------------------------------------------------------------------------------------------------

            gl-master-01-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-05-int    Active    
            History Crawl    2020-12-29 23:00:48

            gl-master-01-int    mvol1         /brick2/mvol1    root
            gl-slave-01-int::svol1    gl-slave-03-int    Active    
            History Crawl    2020-12-29 23:05:45

            gl-master-05-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-03-int    Active    
            History Crawl    2021-02-20 17:38:38

            gl-master-06-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-06-int    Passive   
            N/A              N/A

            gl-master-03-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-05-int    Passive   
            N/A              N/A

            gl-master-03-int    mvol1         /brick2/mvol1    root
            gl-slave-01-int::svol1    gl-slave-04-int    Active    
            History Crawl    2020-12-29 23:07:34

            gl-master-04-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-06-int    Active    
            History Crawl    2020-12-29 23:07:22

            gl-master-04-int    mvol1         /brick2/mvol1    root
            gl-slave-01-int::svol1    gl-slave-01-int    Passive   
            N/A              N/A

            gl-master-02-int    mvol1         /brick1/mvol1    root
            gl-slave-01-int::svol1    gl-slave-01-int    Passive   
            N/A              N/A

            gl-master-02-int    mvol1         /brick2/mvol1    root
            gl-slave-01-int::svol1    gl-slave-06-int    Passive   
            N/A              N/A

            [ 13:14:47 ] - root@gl-master-01  ~ $

        ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

      ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

    -- 
Mit freundlichen Grüßen / Kind Regards

Dietmar Putz 
Head of Infrastructure
dietmar.putz@3q.video
www.3q.video

3Q GmbH 
Kurfürstendamm 102 | 10711 Berlin 

CEO Julius Thomas
Amtsgericht Charlottenburg
Registernummer HRB 217706 B

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users