Re: Issues in AFR and self healing

Pablo Schandin <pablo.schandin@xxxxxxxxxxx> · Tue, 14 Aug 2018 09:15:53 -0300



    Thanks for the info!
    I cannot see any logs in the mount log besides one line every
      time it rotates

    
    [2018-08-13 06:25:02.246187] I
        [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change
        in volfile,continuing

      
    But I did find in the glfsheal-gv1.log of the volumes some kind
      of server-client connection that was disconnected and now it
      connects using a different port. The block of log per each run is
      kind of long so I'm copying it into a pastebin.

    
    https://pastebin.com/bp06rrsT
    Maybe this has something to do with it?
    Thanks!

    
    Pablo.

    
    On 08/11/2018 12:19 AM, Ravishankar N
      wrote:

    
      On 08/10/2018 11:25 PM, Pablo
        Schandin wrote:

      
        Hello everyone!
        I'm having some trouble with something but I'm not quite sure
          of with what yet. I'm running GlusterFS 3.12.6 on Ubuntu
          16.04. I have two servers (nodes) in the cluster in a replica
          mode. Each server has 2 bricks. As the servers are KVM running
          several VMs, one brick has some VMs locally defined in it and
          the second brick is the replicated from the other server. It
          has data but not actual writing is being done except for the
          replication.

        
                                    Server 1                       
                                                      Server 2

          Volume 1 (gv1): Brick 1 defined VMs (read/write)   
          ---->                  Brick 1 replicated qcow2 files

          Volume 2 (gv2): Brick 2 replicated qcow2 files       
          <-----                 Brick 2 defined VMs (read/write)
        So, the main issue arose when I got a nagios alarm that
          warned about a file listed to be healed. And then it
          disappeared. I came to find out that every 5 minutes, the self
          heal daemon triggers the healing and this fixes it. But
          looking at the logs I have a lot of entries in the
          glustershd.log file like this:
        [2018-08-09 14:23:37.689403] I [MSGID:
            108026] [afr-self-heal-common.c:1656:afr_log_selfheal]
            0-gv1-replicate-0: Completed data selfheal on
            407bd97b-e76c-4f81-8f59-7dae11507b0c. sources=[0]  sinks=1 

            [2018-08-09 14:44:37.933143] I [MSGID: 108026]
            [afr-self-heal-common.c:1656:afr_log_selfheal]
            0-gv2-replicate-0: Completed data selfheal on
            73713556-5b63-4f91-b83d-d7d82fee111f. sources=[0]  sinks=1 

        
        The qcow2 files are being healed several times a day (up to
          30 in occasions). As I understand, this means that a data heal
          occurred on file with gfid 407b... and 7371... in source to
          sink. Local server to replica server? Is it OK for the shd to
          heal files in the replicated brick that supposedly has no
          writing on it besides the mirroring? How does that work?
      
      In AFR, for writes, there is no notion of local/remote brick. No
      matter from which client you write to the volume, it gets sent to
      both bricks. i.e. the replication is synchronous and real time. 

       
        How does afr replication work? The file with gfid 7371... is
          the qcow2 root disk of an owncloud server with 17GB of data.
          It does not seem to be that big to be a bottleneck of some
          sort, I think.
        Also, I was investigating the directory tree in
          brick/.glusterfs/indices and I notices that both in xattrop
          and dirty I always have a file created named xattrop-xxxxxx
          and dirty-xxxxxx. I read that the xattrop file is like a
          parent file or handle to reference other files created there
          as hardlinks with gfid name for the shd to heal. Is the same
          case as the ones in the dirty dir?
      
      Yes, before the write, the gfid gets captured inside dirty on all
      bricks. If the write is successful, it gets removed. In addition,
      if the write fails on one brick, the other brick will capture the
      gfid inside xattrop.

      
        Any help will be greatly appreciated it. Thanks!

        
      If frequent heals are triggered, it could mean there are frequent
      network disconnects from the clients to the bricks as writes
      happen. You can check the mount logs and see if that is the case
      and investigate possible network issues.

      
      HTH,

      Ravi 

      
        Pablo.

        
        _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
      
      
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users