Intermittent mount disconnect due to socket poller error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We've been on the Gluster 3.7 series for several years with things pretty stable. Given that it's reached EOL, yesterday I upgraded to 3.13.2. Every Gluster mount and server was disabled then brought back up after the upgrade, changing the op-version to 31302 and then trying it all out.

It went poorly. Every sizable read and write (100's MB) lead to 'Transport endpoint not connected' errors on the command line and immediate unavailability of the mount. After unsuccessfully trying to search for similar problems with solutions, I ended up downgrading to 3.12.6 and changing the op-version to 31202. That brought us back to usability with the majority of those operations succeeding enough to consider it online, but there are still occasional mount disconnects that we never saw with 3.7 - about 6 in the past 18 hours. It seems these disconnects would never come back, either, unless manually re-mounted. Manually remounting reconnects immediately. They only disconnect the affected client, though some simultaneous disconnects have occurred due to simultaneous activity. The lower-level log info seems to indicate a socket problem, potentially broken on the client side based on timing (but the timing is narrow, and I won't claim the clocks are that well synchronized across all our servers). The client and one server claim a socket polling error with no data available, and the other server claims a writev error. This seems to lead the client to the 'all subvolumes are down' state, even though all other clients are still connected. Has anybody run into this? Did I miss anything moving so many versions ahead?

I've included the output of volume info and some excerpts from the logs. We have two servers running glusterd and two replica volumes with a brick on each server. Both experience disconnects; there are about 10 clients for each, with one using both. We use SSL over internal IPv4. Names in all caps were replaced, as were IP addresses.

Let me know if there's anything else I can provide.

% gluster v info VOL
Volume Name: VOL
Type: Replicate
Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: SERVER1:/glusterfs/VOL-brick1/data
Brick2: SERVER2:/glusterfs/VOL-brick2/data
Options Reconfigured:
config.transport: tcp
features.selinux: off
transport.address-family: inet
nfs.disable: on
client.ssl: on
performance.readdir-ahead: on
auth.ssl-allow: [NAMES, including CLIENT]
server.ssl: on
ssl.certificate-depth: 3

Log excerpts (there was nothing related in glusterd.log):

CLIENT:/var/log/glusterfs/mnt-VOL.log
[2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available) [2018-02-28 19:35:58.477154] E [MSGID: 108006] [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2018-02-28 19:35:58.486146] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times>
<lots of saved_frames_unwind messages>
[2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available) [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)

<manual umount / mount>


SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
[2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller] 0-tcp.VOL-server: poll error on socket [2018-02-28 19:35:58.380530] I [MSGID: 115036] [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply] 0-tcp.VOL-server: not connected (priv->connected = -1) [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server) [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) [0x7f97bd37ee6a] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f97bcf1f4c8] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) [0x7f97bcf0abd5] ) 0-: Reply submission failed [2018-02-28 19:35:58.381196] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update] 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT" [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login: connecting user name: CLIENT

SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
[2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv] 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available) [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller] 0-tcp.VOL-server: poll error on socket [2018-02-28 19:35:58.509957] I [MSGID: 115036] [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply] 0-tcp.VOL-server: not connected (priv->connected = -1) [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server) [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) [0x7f85bb7a8e6a] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f85bb3494c8] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) [0x7f85bb334bd5] ) 0-: Reply submission failed [2018-02-28 19:35:58.510409] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update] 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT" [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login: connecting user name: CLIENT
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux