Some bricks are offline after restart, how to bring them online gracefully?

Jan <jan.h.zak@xxxxxxxxx> · Thu, 29 Jun 2017 21:01:22 +0100

Hi all,

Gluster and Ganesha are amazing. Thank you for this great work!

I’m struggling with one issue and I think that you might be able to help me.

I spent some time by playing with Gluster and Ganesha and after I gain some experience I decided that I should go into production but I’m still struggling with one issue.

I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.

Servers have a lot of resources and they run in a subnet on a stable network.

I didn’t have any issues when I tested a single brick. But now I’d like to setup 17 replicated bricks and I realized that when I restart one of nodes then the result looks like this:

sudo gluster volume status | grep ' N '

Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A  
Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A  

Some bricks just don’t go online. Sometime it’s one brick, sometime tree and it’s not same brick – it’s random issue.

I checked log on affected servers and this is an example:

sudo tail /var/log/glusterfs/bricks/st-brick3-0.log 

[2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.2.44.23:24007 failed (No data available)
[2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data available)
[2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] ) 0-:received signum (15), shutting down
[2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs: connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0)
[2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)

I think that important message is “Network is unreachable”.

Question
1. Could you please tell me, is that normal when you have many bricks? Networks is definitely stable and other servers use it without problem and all servers run on a same pair of switches. My assumption is that in the same time many bricks try to connect and that doesn’t work.

2. Is there an option to configure a brick to enable some kind of autoreconnect or add some timeout?
gluster volume set brick123 option456 abc ??

3. What it the recommend way to fix offline brick on the affected server? I don’t want to use “gluster volume stop/start” since affected bricks are online on other server and there is no reason to completely turn it off.

Thank you,
Jan
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users