Is it possible to attach logfiles of problematic client and bricks?
On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee@xxxxxxxxxxxx
<mailto:ryanlee@xxxxxxxxxxxx>> wrote:
We've been on the Gluster 3.7 series for several years with things
pretty stable. Given that it's reached EOL, yesterday I upgraded to
3.13.2. Every Gluster mount and server was disabled then brought
back up after the upgrade, changing the op-version to 31302 and then
trying it all out.
It went poorly. Every sizable read and write (100's MB) lead to
'Transport endpoint not connected' errors on the command line and
immediate unavailability of the mount. After unsuccessfully trying
to search for similar problems with solutions, I ended up
downgrading to 3.12.6 and changing the op-version to 31202. That
brought us back to usability with the majority of those operations
succeeding enough to consider it online, but there are still
occasional mount disconnects that we never saw with 3.7 - about 6 in
the past 18 hours. It seems these disconnects would never come
back, either, unless manually re-mounted. Manually remounting
reconnects immediately. They only disconnect the affected client,
though some simultaneous disconnects have occurred due to
simultaneous activity. The lower-level log info seems to indicate a
socket problem, potentially broken on the client side based on
timing (but the timing is narrow, and I won't claim the clocks are
that well synchronized across all our servers). The client and one
server claim a socket polling error with no data available, and the
other server claims a writev error. This seems to lead the client
to the 'all subvolumes are down' state, even though all other
clients are still connected. Has anybody run into this? Did I miss
anything moving so many versions ahead?
I've included the output of volume info and some excerpts from the
logs. We have two servers running glusterd and two replica volumes
with a brick on each server. Both experience disconnects; there are
about 10 clients for each, with one using both. We use SSL over
internal IPv4. Names in all caps were replaced, as were IP addresses.
Let me know if there's anything else I can provide.
% gluster v info VOL
Volume Name: VOL
Type: Replicate
Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: SERVER1:/glusterfs/VOL-brick1/data
Brick2: SERVER2:/glusterfs/VOL-brick2/data
Options Reconfigured:
config.transport: tcp
features.selinux: off
transport.address-family: inet
nfs.disable: on
client.ssl: on
performance.readdir-ahead: on
auth.ssl-allow: [NAMES, including CLIENT]
server.ssl: on
ssl.certificate-depth: 3
Log excerpts (there was nothing related in glusterd.log):
CLIENT:/var/log/glusterfs/mnt-VOL.log
[2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:49153 failed (No data
available)
[2018-02-28 19:35:58.477154] E [MSGID: 108006]
[afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0:
All subvolumes are down. Going offline until atleast one of them
comes back up.
[2018-02-28 19:35:58.486146] E [MSGID: 101046]
[dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67
times>
<lots of saved_frames_unwind messages>
[2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data
available)
[2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data
available)
<manual umount / mount>
SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
[2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.380530] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
connection from
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xa4e, Program:
GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
(tcp.uploads-server)
[2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
[0x7f97bd37ee6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
[0x7f97bcf1f4c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
[0x7f97bcf0abd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.381196] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
0-/glusterfs/uploads-brick2/data: allowed = "*", received addr =
"CLIENT"
[2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
connecting user name: CLIENT
SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
[2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
[2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.509957] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
connection from
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x4b3f, Program:
GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
(tcp.VOL-server)
[2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
[0x7f85bb7a8e6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
[0x7f85bb3494c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
[0x7f85bb334bd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.510409] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
0-/glusterfs/uploads-brick1/data: allowed = "*", received addr =
"CLIENT"
[2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
connecting user name: CLIENT
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>