Re: GlusterFS cluster stalls if one server from the cluster goes down and then comes back up

Ravishankar N <ravishankar@xxxxxxxxxx> · Wed, 23 Mar 2016 21:22:59 +0530



    On 03/23/2016 02:01 PM, Daniel Kanchev
      wrote:

    
          Hi, everyone.

            
          We are using GlusterFS configured in the following way:

            
            [root@web1 ~]# gluster volume info

             
            Volume Name: share

            Type: Replicate

            Volume ID: hidden data on purpose

            Status: Started

            Number of Bricks: 1 x 3 = 3

            Transport-type: tcp

            Bricks:

            Brick1: c10839:/gluster

            Brick2: c10840:/gluster

            Brick3: web3:/gluster

            Options Reconfigured:

            cluster.consistent-metadata: on

            performance.readdir-ahead: on

            nfs.disable: true

            cluster.self-heal-daemon: on

            cluster.metadata-self-heal: on

            auth.allow: hidden data on purpose

            performance.cache-size: 256MB

            performance.io-thread-count: 8

            performance.cache-refresh-timeout: 3

          
          Here is the output of the status command for the volume
            and the peers:

            
            [root@web1 ~]# gluster volume status

            Status of volume: share

            Gluster process                             TCP Port  RDMA
            Port  Online  Pid

------------------------------------------------------------------------------

            Brick c10839:/gluster                       49152    
            0          Y       540  

            Brick c10840:/gluster                       49152    
            0          Y       533  

            Brick web3:/gluster                         49152    
            0          Y       782  

            Self-heal Daemon on localhost               N/A      
            N/A        Y       602  

            Self-heal Daemon on web3                    N/A      
            N/A        Y       790  

            Self-heal Daemon on web4                    N/A      
            N/A        Y       636  

            Self-heal Daemon on web2                    N/A      
            N/A        Y       523  

             
            Task Status of Volume share

------------------------------------------------------------------------------

            There are no active volume tasks

            
            [root@web1 ~]# gluster peer status

            Number of Peers: 3

            
            Hostname: web3

            Uuid: b138b4d5-8623-4224-825e-1dfdc3770743

            State: Peer in Cluster (Connected)

            
            Hostname: web2

            Uuid: b3926959-3ae8-4826-933a-4bf3b3bd55aa

            State: Peer in Cluster (Connected)

            Other names:

            c10840.sgvps.net

            
            Hostname: web4

            Uuid: f7553cba-c105-4d2c-8b89-e5e78a269847

            State: Peer in Cluster (Connected)

            
          All in all, we have three servers that are servers and
            actually store the data and one server which is just a peer
            and is connected to one of the other servers.

            
          The Problem: If any of the 4 servers goes down
            then the cluster continues to work as expected. However,
            once this server comes back up then the whole cluster stalls
            for a certain period of time (30-120 seconds). During this
            period no I/O operations could be executed and the apps that
            use the data on the GlusterFS simply go down because they
            cannot read/write any data. 

            
          We suspect that the issue is related to the self-heal
            daemons but we are not sure. Could you please advice how to
            debug this issue and what could be causing the whole cluster
            to go down. If it is the self-heal as we suspect do you
            think it is ok to disable it. If some of the settings are
            causing this problem could you please advice how to
            configure the cluster to avoid this problem.

            
    What version of gluster is this?

    Do you observe the problem even when only the 4th 'non data' server
    comes up? In that case it is unlikely that self-heal is the issue.

    Are the clients using FUSE or NFS mounts?

    -Ravi

    
          If any info from the logs is requested please let us know
            what do you need.

          
          Thanks in advance!

          
          Regards,

        
        Daniel

      
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users