Re: Issues in AFR and self healing

Pablo Schandin <pablo.schandin@xxxxxxxxxxx> · Tue, 21 Aug 2018 11:10:55 -0300



    I couldn't find any disconnections yet. We analyzed the port's
      traffic to see if there was too much data going through, but that
      was OK. I also cannot see any other disconnections so for now we
      will continue to check the network because I might have missed
      something.
    Thanks for all the help! If I have any other news I will let you
      know.

    
    Pablo.

      
    On 08/16/2018 01:06 AM, Ravishankar N
      wrote:

    
      On 08/15/2018 11:07 PM, Pablo
        Schandin wrote:

      
        I found another log that I wasn't aware of in
          /var/log/glusterfs/brick, that is te mount log, I confused the
          log files. In this file I see a lot of entries like this one:
        [2018-08-15 16:41:19.568477] I
            [addr.c:55:compare_addr_and_update] 0-/mnt/brick1/gv1:
            allowed = "172.20.36.10", received addr = "172.20.36.11"

            [2018-08-15 16:41:19.568527] I
            [addr.c:55:compare_addr_and_update] 0-/mnt/brick1/gv1:
            allowed = "172.20.36.11", received addr = "172.20.36.11"

            [2018-08-15 16:41:19.568547] I [login.c:76:gf_auth]
            0-auth/login: allowed user names:
            7107ccfa-0ba1-4172-aa5a-031568927bf1

            [2018-08-15 16:41:19.568564] I [MSGID: 115029]
            [server-handshake.c:793:server_setvolume] 0-gv1-server:
            accepted client from
physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0
            (version: 3.1

            2.6)

            [2018-08-15 16:41:19.582710] I [MSGID: 115036]
            [server.c:527:server_rpc_notify] 0-gv1-server: disconnecting
            connection from
physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0

            [2018-08-15 16:41:19.582830] I [MSGID: 101055]
            [client_t.c:443:gf_client_unref] 0-gv1-server: Shutting down
            connection
physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0

          
        So I see a lot of disconnections, right? This might be why
          the self healing is triggered all the time? 

        
      Not necessarily. These disconnects could also be due to the
      glfsheal binary which is invoked when you run `gluster vol heal
      volname info` etc and do not cause heals. It would be better to
      check your client mount logs for disconnect messages like these:

      
      [2018-08-16 03:59:32.289763] I [MSGID: 114018]
        [client.c:2285:client_rpc_notify] 0-testvol-client-4:
        disconnected from testvol-client-0. Client process will keep
        trying to connect to glusterd until brick's port is available

        
      If there are no disconnects and you are still seeing files
      undergoing heal, then you might want to check the brick logs to
      see if there are any write failures.

      Thanks,

      Ravi

      
        Thanks! 

        
        Pablo.

          Avature
          Get Engaged to Talent
          

        On 08/14/2018 09:15 AM, Pablo
          Schandin wrote:

        
          Thanks for the info!
          I cannot see any logs in the mount log besides one line
            every time it rotates

          
          [2018-08-13 06:25:02.246187] I
              [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No
              change in volfile,continuing

            
          But I did find in the glfsheal-gv1.log of the volumes some
            kind of server-client connection that was disconnected and
            now it connects using a different port. The block of log per
            each run is kind of long so I'm copying it into a pastebin.

          
          https://pastebin.com/bp06rrsT
          Maybe this has something to do with it?
          Thanks!

          
          Pablo.

          
          On 08/11/2018 12:19 AM,
            Ravishankar N wrote:

          
            On 08/10/2018 11:25 PM, Pablo
              Schandin wrote:

            
              Hello everyone!
              I'm having some trouble with something but I'm not
                quite sure of with what yet. I'm running GlusterFS
                3.12.6 on Ubuntu 16.04. I have two servers (nodes) in
                the cluster in a replica mode. Each server has 2 bricks.
                As the servers are KVM running several VMs, one brick
                has some VMs locally defined in it and the second brick
                is the replicated from the other server. It has data but
                not actual writing is being done except for the
                replication.

              
                                          Server 1                   
                                                                Server 2

                Volume 1 (gv1): Brick 1 defined VMs (read/write)   
                ---->                  Brick 1 replicated qcow2 files

                Volume 2 (gv2): Brick 2 replicated qcow2 files       
                <-----                 Brick 2 defined VMs
                (read/write)
              So, the main issue arose when I got a nagios alarm that
                warned about a file listed to be healed. And then it
                disappeared. I came to find out that every 5 minutes,
                the self heal daemon triggers the healing and this fixes
                it. But looking at the logs I have a lot of entries in
                the glustershd.log file like this:
              [2018-08-09 14:23:37.689403] I [MSGID:
                  108026] [afr-self-heal-common.c:1656:afr_log_selfheal]
                  0-gv1-replicate-0: Completed data selfheal on
                  407bd97b-e76c-4f81-8f59-7dae11507b0c. sources=[0] 
                  sinks=1 

                  [2018-08-09 14:44:37.933143] I [MSGID: 108026]
                  [afr-self-heal-common.c:1656:afr_log_selfheal]
                  0-gv2-replicate-0: Completed data selfheal on
                  73713556-5b63-4f91-b83d-d7d82fee111f. sources=[0] 
                  sinks=1 

              
              The qcow2 files are being healed several times a day
                (up to 30 in occasions). As I understand, this means
                that a data heal occurred on file with gfid 407b... and
                7371... in source to sink. Local server to replica
                server? Is it OK for the shd to heal files in the
                replicated brick that supposedly has no writing on it
                besides the mirroring? How does that work?
            
            In AFR, for writes, there is no notion of local/remote
            brick. No matter from which client you write to the volume,
            it gets sent to both bricks. i.e. the replication is
            synchronous and real time. 

             
              How does afr replication work? The file with gfid
                7371... is the qcow2 root disk of an owncloud server
                with 17GB of data. It does not seem to be that big to be
                a bottleneck of some sort, I think.
              Also, I was investigating the directory tree in
                brick/.glusterfs/indices and I notices that both in
                xattrop and dirty I always have a file created named
                xattrop-xxxxxx and dirty-xxxxxx. I read that the xattrop
                file is like a parent file or handle to reference other
                files created there as hardlinks with gfid name for the
                shd to heal. Is the same case as the ones in the dirty
                dir?
            
            Yes, before the write, the gfid gets captured inside dirty
            on all bricks. If the write is successful, it gets removed.
            In addition, if the write fails on one brick, the other
            brick will capture the gfid inside xattrop.

            
              Any help will be greatly appreciated it. Thanks!

              
            If frequent heals are triggered, it could mean there are
            frequent network disconnects from the clients to the bricks
            as writes happen. You can check the mount logs and see if
            that is the case and investigate possible network issues.

            
            HTH,

            Ravi 

            
              Pablo.

              
              _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
            
            
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users