Re: Initial mount problem - all subvolumes are down

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Tue, 31 Mar 2015 23:23:59 +0530



    On 03/31/2015 10:47 PM, Rumen Telbizov
      wrote:

    
        Pranith
          and Atin,

          
        Thank
          you for looking into this and confirming it's a bug. Please
          log the bug yourself since I am not familiar with the
          project's bug-tracking system.

          
        Assessing
          its severity and the fact that this effectively stops the
          cluster from functioning properly after boot, what do you
          think would be the timeline for fixing this issue? What
          version do you expect to see this fixed in?

          
        In
          the meantime, is there another workaround that you might
          suggest besides running a secondary mount later after the boot
          is over?

        
    Adding glusterd maintainers to the thread: +kaushal, +krishnan

    I will let them answer your questions.

    
    Pranith

    
        Thank
          you again for your help,

        
        Rumen
          Telbizov

        
        On Tue, Mar 31, 2015 at 2:53 AM,
          Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:

          
              On 03/31/2015 01:55 PM, Atin Mukherjee wrote:

              
                On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:

                
                  On 03/31/2015 12:53 PM, Atin Mukherjee wrote:

                  
                    On 03/31/2015 12:27 PM, Pranith Kumar Karampuri
                    wrote:

                    
                      Atin,

                               Could it be because bricks are started
                      with PROC_START_NO_WAIT?

                    
                    That's the correct analysis Pranith. Mount was
                    attempted before the

                    bricks were started. If we can have a time lag in
                    some seconds between

                    mount and volume start the problem will go away.

                  
                  Atin,

                          I think one way to solve this issue is to
                  start the bricks with

                  NO_WAIT so that we can handle pmap-signin but wait for
                  the pmap-signins

                  to complete before responding to cli/completing
                  'init'?

                
                Logically it should solve the problem. We need to think
                around it more

                from the existing design perspective.

              
            Rumen,

                 Feel free to log a bug. This should be fixed in later
            release. We can raise the bug and work it as well if you
            prefer it this way.

                
                Pranith
            
              
                  ~Atin

                  
                    Pranith

                    
                        Pranith

                        On 03/31/2015 04:41 AM, Rumen Telbizov wrote:

                        
                          Hello everyone,

                          
                          I have a problem that I am trying to resolve
                          and not sure which way to

                          go so here I am asking for your advise.

                          
                          What it comes down to is that upon initial
                          boot of all my GlusterFS

                          machines the shared volume doesn't get
                          mounted. Nevertheless the

                          volume successfully created and started and
                          further attempts to mount

                          it manually succeed. I suspect what's
                          happening is that gluster

                          processes/bricks/etc haven't fully started at
                          the time the /etc/fstab

                          entry is read and the initial mount attempt is
                          being made. Again, by

                          the time I log in and run a mount -a -- the
                          volume mounts without any

                          issues.

                          
                          _Details from the logs:_

                          
                          [2015-03-30 22:29:04.381918] I [MSGID: 100030]

                          [glusterfsd.c:2018:main]
                          0-/usr/sbin/glusterfs: Started running

                          /usr/sbin/glusterfs version 3.6.2 (args:
                          /usr/sbin/glusterfs

                          --log-file=/var/log/glusterfs/glusterfs.log
                          --attribute-timeout=0

                          --entry-timeout=0 --volfile-server=localhost

                          --volfile-server=10.12.130.21
                          --volfile-server=10.12.130.22

                          --volfile-server=10.12.130.23
                          --volfile-id=/myvolume /opt/shared)

                          [2015-03-30 22:29:04.394913] E
                          [socket.c:2267:socket_connect_finish]

                          0-glusterfs: connection to 127.0.0.1:24007 <http://127.0.0.1:24007>

                          failed (Connection refused)

                          [2015-03-30 22:29:04.394950] E

                          [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: failed to

                          connect with remote-host: localhost (Transport
                          endpoint is not

                          connected)

                          [2015-03-30 22:29:04.394964] I

                          [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: connecting

                          to next volfile server 10.12.130.21

                          [2015-03-30 22:29:08.390687] E

                          [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: failed to

                          connect with remote-host: 10.12.130.21
                          (Transport endpoint is not

                          connected)

                          [2015-03-30 22:29:08.390720] I

                          [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: connecting

                          to next volfile server 10.12.130.22

                          [2015-03-30 22:29:11.392015] E

                          [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: failed to

                          connect with remote-host: 10.12.130.22
                          (Transport endpoint is not

                          connected)

                          [2015-03-30 22:29:11.392050] I

                          [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
                          0-glusterfsd-mgmt: connecting

                          to next volfile server 10.12.130.23

                          [2015-03-30 22:29:14.406429] I
                          [dht-shared.c:337:dht_init_regex]

                          0-brain-dht: using regex rsync-hash-regex =
                          ^\.(.+)\.[^.]+$

                          [2015-03-30 22:29:14.408964] I

                          [rpc-clnt.c:969:rpc_clnt_connection_init]
                          0-host-client-2: setting

                          frame-timeout to 60

                          [2015-03-30 22:29:14.409183] I

                          [rpc-clnt.c:969:rpc_clnt_connection_init]
                          0-host-client-1: setting

                          frame-timeout to 60

                          [2015-03-30 22:29:14.409388] I

                          [rpc-clnt.c:969:rpc_clnt_connection_init]
                          0-host-client-0: setting

                          frame-timeout to 60

                          [2015-03-30 22:29:14.409430] I
                          [client.c:2280:notify] 0-host-client-0:

                          parent translators are ready, attempting
                          connect on transport

                          [2015-03-30 22:29:14.409658] I
                          [client.c:2280:notify] 0-host-client-1:

                          parent translators are ready, attempting
                          connect on transport

                          [2015-03-30 22:29:14.409844] I
                          [client.c:2280:notify] 0-host-client-2:

                          parent translators are ready, attempting
                          connect on transport

                          Final graph:

                          
                          ....

                          
                          [2015-03-30 22:29:14.411045] I
                          [client.c:2215:client_rpc_notify]

                          0-host-client-2: disconnected from
                          host-client-2. Client process will

                          keep trying to connect to glusterd until
                          brick's port is available

                          *[2015-03-30 22:29:14.411063] E [MSGID:
                          108006]

                          [afr-common.c:3591:afr_notify]
                          0-myvolume-replicate-0: All subvolumes

                          are down. Going offline until atleast one of
                          them comes back up.

                          *[2015-03-30 22:29:14.414871] I
                          [fuse-bridge.c:5080:fuse_graph_setup]

                          0-fuse: switched to graph 0

                          [2015-03-30 22:29:14.415003] I
                          [fuse-bridge.c:4009:fuse_init]

                          0-glusterfs-fuse: FUSE inited with protocol
                          versions: glusterfs 7.22

                          kernel 7.17

                          [2015-03-30 22:29:14.415101] I
                          [afr-common.c:3722:afr_local_init]

                          0-myvolume-replicate-0: no subvolumes up

                          [2015-03-30 22:29:14.415215] I
                          [afr-common.c:3722:afr_local_init]

                          0-myvolume-replicate-0: no subvolumes up

                          [2015-03-30 22:29:14.415236] W
                          [fuse-bridge.c:779:fuse_attr_cbk]

                          0-glusterfs-fuse: 2: LOOKUP() / => -1
                          (Transport endpoint is not

                          connected)

                          [2015-03-30 22:29:14.419007] I
                          [fuse-bridge.c:4921:fuse_thread_proc]

                          0-fuse: unmounting /opt/shared

                          *[2015-03-30 22:29:14.420176] W
                          [glusterfsd.c:1194:cleanup_and_exit]

                          (--> 0-: received signum (15), shutting
                          down*

                          [2015-03-30 22:29:14.420192] I
                          [fuse-bridge.c:5599:fini] 0-fuse:

                          Unmounting '/opt/shared'.

                          
                          _Relevant /etc/fstab entries are:_

                          
                          /dev/xvdb /opt/local xfs
                          defaults,noatime,nodiratime 0 0

                          
                          localhost:/myvolume /opt/shared glusterfs

                          defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10.12.130.22:10.12.130.23

                          
                          0 0

                          
                          _Volume configuration is:_

                          
                          Volume Name: myvolume

                          Type: Replicate

                          Volume ID: xxxx

                          Status: Started

                          Number of Bricks: 1 x 3 = 3

                          Transport-type: tcp

                          Bricks:

                          Brick1: host1:/opt/local/brick

                          Brick2: host2:/opt/local/brick

                          Brick3: host3:/opt/local/brick

                          Options Reconfigured:

                          storage.health-check-interval: 5

                          network.ping-timeout: 5

                          nfs.disable: on

                          auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23

                          cluster.quorum-type: auto

                          network.frame-timeout: 60

                          
                          I run Debian 7 and the following GlusterFS
                          version 3.6.2-2.

                          
                          While I could together some rc.local type of
                          script which retries to

                          mount the volume for a while until it succeeds
                          or times out I was

                          wondering if there's a better way to solve
                          this problem?

                          
                          Thank you for your help.

                          
                          Regards,

                          -- 

                          Rumen Telbizov

                          Unix Systems Administrator <http://telbizov.com>

                          
                          _______________________________________________

                          Gluster-users mailing list

                          Gluster-users@xxxxxxxxxxx

                          http://www.gluster.org/mailman/listinfo/gluster-users

                        
        -- 

        
            Rumen
                Telbizov
              Unix Systems Administrator
            
          
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users