On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak@xxxxxxxxx> wrote:
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0)
Hi all,Gluster and Ganesha are amazing. Thank you for this great work!I’m struggling with one issue and I think that you might be able to help me.I spent some time by playing with Gluster and Ganesha and after I gain some experience I decided that I should go into production but I’m still struggling with one issue.I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.Servers have a lot of resources and they run in a subnet on a stable network.I didn’t have any issues when I tested a single brick. But now I’d like to setup 17 replicated bricks and I realized that when I restart one of nodes then the result looks like this:sudo gluster volume status | grep ' N 'Brick glunode0:/st/brick3/dir N/A N/A N N/ABrick glunode1:/st/brick2/dir N/A N/A N N/ASome bricks just don’t go online. Sometime it’s one brick, sometime tree and it’s not same brick – it’s random issue.I checked log on affected servers and this is an example:sudo tail /var/log/glusterfs/bricks/st-brick3-0.log [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.2.44.23:24007 failed (No data available)[2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data available) [2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit] (-->/lib64/libpthread.so.0(+ 0x7dc5) [0x7f3158032dc5] -->/usr/sbin/glusterfsd( glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5] -->/usr/sbin/glusterfsd( cleanup_and_exit+0x6b) [0x7f31596cbdfb] ) 0-:received signum (15), shutting down [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs: connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
This happens when connect () syscall fails with ENETUNREACH errno as per the followint code
if (ign_enoent) {
ret = connect_loop (priv->sock,
SA (&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
} else {
ret = connect (priv->sock,
SA (&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
}
if (ret == -1 && errno == ENOENT && ign_enoent) {
gf_log (this->name, GF_LOG_WARNING,
"Ignore failed connection attempt on %s, (%s) ",
this->peerinfo.identifier, strerror (errno));
/* connect failed with some other error than EINPROGRESS
so, getsockopt (... SO_ERROR ...), will not catch any
errors and return them to us, we need to remember this
state, and take actions in socket_event_handler
appropriately */
/* TBD: What about ENOENT, we will do getsockopt there
as well, so how is that exempt from such a problem? */
priv->connect_failed = 1;
this->connect_failed = _gf_true;
goto handler;
}
if (ret == -1 && ((errno != EINPROGRESS) && (errno != ENOENT))) {
/* For unix path based sockets, the socket path is
* cryptic (md5sum of path) and may not be useful for
* the user in debugging so log it in DEBUG
*/
gf_log (this->name, ((sa_family == AF_UNIX) ? <===== this is the log which gets generated
GF_LOG_DEBUG : GF_LOG_ERROR),
"connection attempt on %s failed, (%s)",
this->peerinfo.identifier, strerror (errno));
if (ign_enoent) {
ret = connect_loop (priv->sock,
SA (&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
} else {
ret = connect (priv->sock,
SA (&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
}
if (ret == -1 && errno == ENOENT && ign_enoent) {
gf_log (this->name, GF_LOG_WARNING,
"Ignore failed connection attempt on %s, (%s) ",
this->peerinfo.identifier, strerror (errno));
/* connect failed with some other error than EINPROGRESS
so, getsockopt (... SO_ERROR ...), will not catch any
errors and return them to us, we need to remember this
state, and take actions in socket_event_handler
appropriately */
/* TBD: What about ENOENT, we will do getsockopt there
as well, so how is that exempt from such a problem? */
priv->connect_failed = 1;
this->connect_failed = _gf_true;
goto handler;
}
if (ret == -1 && ((errno != EINPROGRESS) && (errno != ENOENT))) {
/* For unix path based sockets, the socket path is
* cryptic (md5sum of path) and may not be useful for
* the user in debugging so log it in DEBUG
*/
gf_log (this->name, ((sa_family == AF_UNIX) ? <===== this is the log which gets generated
GF_LOG_DEBUG : GF_LOG_ERROR),
"connection attempt on %s failed, (%s)",
this->peerinfo.identifier, strerror (errno));
IMO, this can only happen if there is an intermittent n/w failure?
@Raghavendra G/ Mohit - do you have any other opinion?
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_
[2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) I think that important message is “Network is unreachable”.Question1. Could you please tell me, is that normal when you have many bricks? Networks is definitely stable and other servers use it without problem and all servers run on a same pair of switches. My assumption is that in the same time many bricks try to connect and that doesn’t work.2. Is there an option to configure a brick to enable some kind of autoreconnect or add some timeout?gluster volume set brick123 option456 abc ??3. What it the recommend way to fix offline brick on the affected server? I don’t want to use “gluster volume stop/start” since affected bricks are online on other server and there is no reason to completely turn it off.Thank you,Jan
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users