We've got an odd problem where clients are blocked from writing to Gluster volumes until the first node of the Gluster cluster is rebooted. I suspect I've either configured something incorrectly with the arbiter / replica configuration of the volumes, or there is some sort of bug in the gluster client-server connection that we're triggering. I was wondering if anyone has seen this or could point me in the right direction? Environment:
Problem:
What 'fixes' the issue:
Expected behaviour:
Client gluster errors:
On a gluster client (a kubernetes host using the kubernetes connector which uses the native fuse client) when its blocked from writing but the gluster appears healthy (other than the errors mentioned later): [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 15:33:22.750989] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 16:03:23.097988] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 16:33:23.439282] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 17:03:23.786971] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 17:33:24.160720] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 18:03:24.505185] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 18:33:24.841311] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 19:03:25.204784] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 19:33:25.533611] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 20:03:25.877098] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 20:33:26.217973] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 21:03:26.588316] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 21:33:26.912449] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 22:03:37.259000] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 22:33:37.615574] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-02 23:03:37.941049] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-02 23:33:38.271078] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-03 00:03:38.607263] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-03 00:33:38.934479] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 [2018-09-03 01:03:39.256972] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 [2018-09-03 01:33:39.614483] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] On the second gluster server: We are seeing the following error in the glusterd.log file when the client is blocked from writing the volume, I think this is probably the most important information about the error and suggests a problem with the first node but doesn't explain the client behaviour: [2018-09-02 08:31:03.902272] E [MSGID: 106115] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on <FQDN of the first gluster node>. Please check log file for details. [2018-09-02 08:31:03.902477] E [MSGID: 106151] [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s) Note in the above error: 1. I'm not sure which log to check (there doesn't seem to be a management brick / brick log)? 2. If there's a problem with the first node, why isn't it rejected from the gluster / taken offline / the health of the peers or volume list degraded? 3. Why does the client fail to write to the volume rather than (I'm assuming) trying the second (or third I guess) node to write to the volume? We are also seeing the following errors repeated a lot in the logs, both when the volumes are working and when there's an issue in the brick log (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) [0x7f8470319d14] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) [0x7f846bdde24a] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7f846bd89fce] ) 0-: Reply submission failed [2018-09-03 01:58:35.128957] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport (tcp.staging_static-server) [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) [0x7f8470319d14] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) [0x7f846bdde24a] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7f846bd89fce] ) 0-: Reply submission failed [2018-09-03 01:58:35.129016] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport (tcp.staging_static-server) [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) [0x7f8470319d14] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) [0x7f846bdde24a] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7f846bd89fce] ) 0-: Reply submission failed [2018-09-03 01:58:35.129077] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport (tcp.staging_static-server) [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) [0x7f8470319d14] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) [0x7f846bdde24a] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7f846bd89fce] ) 0-: Reply submission failed [2018-09-03 01:58:35.129191] E [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport (tcp.staging_static-server) [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) [0x7f8470319d14] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) [0x7f846bdde24a] -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) [0x7f846bd89fce] ) 0-: Reply submission failed Gluster volume information: # gluster volume info staging_static Volume Name: staging_static Type: Replicate Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static Brick2: <second gluster node.fqdn>:/mnt/gluster-storage/staging_static Brick3: <third gluster node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) Options Reconfigured: storage.fips-mode-rchecksum: true cluster.self-heal-window-size: 16 cluster.shd-wait-qlength: 4096 cluster.shd-max-threads: 8 performance.cache-min-file-size: 2KB performance.rda-cache-limit: 1GB network.inode-lru-limit: 50000 server.outstanding-rpc-limit: 256 transport.listen-backlog: 2048 performance.write-behind-window-size: 512MB performance.stat-prefetch: true performance.io-thread-count: 16 performance.client-io-threads: true performance.cache-size: 1GB performance.cache-refresh-timeout: 60 performance.cache-invalidation: true cluster.use-compound-fops: true cluster.readdir-optimize: true cluster.lookup-optimize: true cluster.favorite-child-policy: size cluster.eager-lock: true client.event-threads: 4 nfs.disable: on transport.address-family: inet diagnostics.brick-log-level: ERROR diagnostics.client-log-level: ERROR features.cache-invalidation-timeout: 300 features.cache-invalidation: true network.ping-timeout: 15 performance.cache-max-file-size: 3MB performance.md-cache-timeout: 300 server.event-threads: 4 Thanks in advance, -- Sam McLeod (protoporpoise on IRC) https://smcleod.net https://twitter.com/s_mcleod Words are my own opinions and do not necessarily represent those of my employer or partners. |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users