Re: Initial mount problem - all subvolumes are down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 03/31/2015 10:47 PM, Rumen Telbizov wrote:
Pranith and Atin,

Thank you for looking into this and confirming it's a bug. Please log the bug yourself since I am not familiar with the project's bug-tracking system.

Assessing its severity and the fact that this effectively stops the cluster from functioning properly after boot, what do you think would be the timeline for fixing this issue? What version do you expect to see this fixed in?

In the meantime, is there another workaround that you might suggest besides running a secondary mount later after the boot is over?
Adding glusterd maintainers to the thread: +kaushal, +krishnan
I will let them answer your questions.

Pranith

Thank you again for your help,
Rumen Telbizov



On Tue, Mar 31, 2015 at 2:53 AM, Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:

On 03/31/2015 01:55 PM, Atin Mukherjee wrote:

On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:
On 03/31/2015 12:53 PM, Atin Mukherjee wrote:
On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote:
Atin,
         Could it be because bricks are started with PROC_START_NO_WAIT?
That's the correct analysis Pranith. Mount was attempted before the
bricks were started. If we can have a time lag in some seconds between
mount and volume start the problem will go away.
Atin,
        I think one way to solve this issue is to start the bricks with
NO_WAIT so that we can handle pmap-signin but wait for the pmap-signins
to complete before responding to cli/completing 'init'?
Logically it should solve the problem. We need to think around it more
from the existing design perspective.
Rumen,
     Feel free to log a bug. This should be fixed in later release. We can raise the bug and work it as well if you prefer it this way.

Pranith


~Atin
Pranith

Pranith
On 03/31/2015 04:41 AM, Rumen Telbizov wrote:
Hello everyone,

I have a problem that I am trying to resolve and not sure which way to
go so here I am asking for your advise.

What it comes down to is that upon initial boot of all my GlusterFS
machines the shared volume doesn't get mounted. Nevertheless the
volume successfully created and started and further attempts to mount
it manually succeed. I suspect what's happening is that gluster
processes/bricks/etc haven't fully started at the time the /etc/fstab
entry is read and the initial mount attempt is being made. Again, by
the time I log in and run a mount -a -- the volume mounts without any
issues.

_Details from the logs:_

[2015-03-30 22:29:04.381918] I [MSGID: 100030]
[glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running
/usr/sbin/glusterfs version 3.6.2 (args: /usr/sbin/glusterfs
--log-file=/var/log/glusterfs/glusterfs.log --attribute-timeout=0
--entry-timeout=0 --volfile-server=localhost
--volfile-server=10.12.130.21 --volfile-server=10.12.130.22
--volfile-server=10.12.130.23 --volfile-id=/myvolume /opt/shared)
[2015-03-30 22:29:04.394913] E [socket.c:2267:socket_connect_finish]
0-glusterfs: connection to 127.0.0.1:24007 <http://127.0.0.1:24007>
failed (Connection refused)
[2015-03-30 22:29:04.394950] E
[glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
connect with remote-host: localhost (Transport endpoint is not
connected)
[2015-03-30 22:29:04.394964] I
[glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting
to next volfile server 10.12.130.21
[2015-03-30 22:29:08.390687] E
[glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
connect with remote-host: 10.12.130.21 (Transport endpoint is not
connected)
[2015-03-30 22:29:08.390720] I
[glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting
to next volfile server 10.12.130.22
[2015-03-30 22:29:11.392015] E
[glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
connect with remote-host: 10.12.130.22 (Transport endpoint is not
connected)
[2015-03-30 22:29:11.392050] I
[glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting
to next volfile server 10.12.130.23
[2015-03-30 22:29:14.406429] I [dht-shared.c:337:dht_init_regex]
0-brain-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2015-03-30 22:29:14.408964] I
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-2: setting
frame-timeout to 60
[2015-03-30 22:29:14.409183] I
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-1: setting
frame-timeout to 60
[2015-03-30 22:29:14.409388] I
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-0: setting
frame-timeout to 60
[2015-03-30 22:29:14.409430] I [client.c:2280:notify] 0-host-client-0:
parent translators are ready, attempting connect on transport
[2015-03-30 22:29:14.409658] I [client.c:2280:notify] 0-host-client-1:
parent translators are ready, attempting connect on transport
[2015-03-30 22:29:14.409844] I [client.c:2280:notify] 0-host-client-2:
parent translators are ready, attempting connect on transport
Final graph:

....

[2015-03-30 22:29:14.411045] I [client.c:2215:client_rpc_notify]
0-host-client-2: disconnected from host-client-2. Client process will
keep trying to connect to glusterd until brick's port is available
*[2015-03-30 22:29:14.411063] E [MSGID: 108006]
[afr-common.c:3591:afr_notify] 0-myvolume-replicate-0: All subvolumes
are down. Going offline until atleast one of them comes back up.
*[2015-03-30 22:29:14.414871] I [fuse-bridge.c:5080:fuse_graph_setup]
0-fuse: switched to graph 0
[2015-03-30 22:29:14.415003] I [fuse-bridge.c:4009:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22
kernel 7.17
[2015-03-30 22:29:14.415101] I [afr-common.c:3722:afr_local_init]
0-myvolume-replicate-0: no subvolumes up
[2015-03-30 22:29:14.415215] I [afr-common.c:3722:afr_local_init]
0-myvolume-replicate-0: no subvolumes up
[2015-03-30 22:29:14.415236] W [fuse-bridge.c:779:fuse_attr_cbk]
0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not
connected)
[2015-03-30 22:29:14.419007] I [fuse-bridge.c:4921:fuse_thread_proc]
0-fuse: unmounting /opt/shared
*[2015-03-30 22:29:14.420176] W [glusterfsd.c:1194:cleanup_and_exit]
(--> 0-: received signum (15), shutting down*
[2015-03-30 22:29:14.420192] I [fuse-bridge.c:5599:fini] 0-fuse:
Unmounting '/opt/shared'.


_Relevant /etc/fstab entries are:_

/dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0 0

localhost:/myvolume /opt/shared glusterfs
defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10.12.130.22:10.12.130.23

0 0


_Volume configuration is:_

Volume Name: myvolume
Type: Replicate
Volume ID: xxxx
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: host1:/opt/local/brick
Brick2: host2:/opt/local/brick
Brick3: host3:/opt/local/brick
Options Reconfigured:
storage.health-check-interval: 5
network.ping-timeout: 5
nfs.disable: on
auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23
cluster.quorum-type: auto
network.frame-timeout: 60


I run Debian 7 and the following GlusterFS version 3.6.2-2.

While I could together some rc.local type of script which retries to
mount the volume for a while until it succeeds or times out I was
wondering if there's a better way to solve this problem?

Thank you for your help.

Regards,
--
Rumen Telbizov
Unix Systems Administrator <http://telbizov.com>


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users






--

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux