Thank you. I just upgraded my nodes to 5.0.1 and everything seems to be running smoothly. Plus the cluster.data-self-heal=off reconfigured option has gone away during the update, so I guess I'm back on nominal. Hoggins! Le 24/10/2018 à 13:57, Ravishankar N a écrit : > > > > On 10/24/2018 05:16 PM, Hoggins! wrote: >> Thank you, it's working as expected. >> >> I guess it's only safe to put cluster.data-self-heal back on when I get >> an updated version of GlusterFS? > Yes correct. Also, you would still need to restart shd whenever you > hit this issue until upgrade. > -Ravi >> Hoggins! >> >> Le 24/10/2018 à 11:53, Ravishankar N a écrit : >>> On 10/24/2018 02:38 PM, Hoggins! wrote: >>>> Thanks, that's helping a lot, I will do that. >>>> >>>> One more question: should the glustershd restart be performed on the >>>> arbiter only, or on each node of the cluster? >>> If you do a 'gluster volume start volname force' it will restart the >>> shd on all nodes. >>> -Ravi >>>> Thanks! >>>> >>>> Hoggins! >>>> >>>> Le 24/10/2018 à 02:55, Ravishankar N a écrit : >>>>> On 10/23/2018 10:01 PM, Hoggins! wrote: >>>>>> Hello there, >>>>>> >>>>>> I'm stumbling upon the *exact same issue*, and unfortunately setting the >>>>>> server.tcp-user-timeout to 42 does not help. >>>>>> Any other suggestion? >>>>>> >>>>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running >>>>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a >>>>>> node) suggested by Sam works, but it often doesn't. >>>>>> >>>>>> You may ask how I got into this, well it's simple: I needed to replace >>>>>> my brick 1 and brick 2 with two brand new machines, so here's what I did: >>>>>> - add brick 3 and brick 4 into the cluster (gluster peer probe, >>>>>> gluster volume add-brick, etc., with the issue regarding the arbiter >>>>>> node that has to be first removed from the cluster before being able to >>>>>> add bricks 3 and 4) >>>>>> - wait for all the files on my volumes to heal. It took a few days. >>>>>> - remove bricks 1 and 2 >>>>>> - after having "reset" the arbiter, re-add the arbiter into the cluster >>>>>> >>>>>> And now it's intermittently hanging on writing *on existing files*. >>>>>> There is *no problem for writing new files* on the volumes. >>>>> Hi, >>>>> >>>>> There was a arbiter volume hang issue that was fixed [1] recently. >>>>> The fix has been back-ported to all release branches. >>>>> >>>>> One workaround to overcome hangs is to (1)turn off 'testvol >>>>> cluster.data-self-heal', remount the clients *and* (2) restart >>>>> glustershd (via volume start force). The hang is observed due to an >>>>> unreleased lock from self-heal. There are other ways to release the >>>>> stale lock via gluster clear-locks command or tweaking >>>>> features.locks-revocation-secs but restarting shd whenever you see the >>>>> issue is the easiest and safest way. >>>>> >>>>> -Ravi >>>>> >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802 >>>>> >>>>> >>>>>> I'm lost here, thanks for your inputs! >>>>>> >>>>>> Hoggins! >>>>>> >>>>>> Le 14/09/2018 à 04:16, Amar Tumballi a écrit : >>>>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists@xxxxxxxxxxx >>>>>>> <mailto:mailinglists@xxxxxxxxxxx>> wrote: >>>>>>> >>>>>>> I apologise for this being posted twice - I'm not sure if that was >>>>>>> user error or a bug in the mailing list, but the list wasn't >>>>>>> showing my post after quite some time so I sent a second email >>>>>>> which near immediately showed up - that's mailing lists I guess... >>>>>>> >>>>>>> Anyway, if anyone has any input, advice or abuse I'm welcome any >>>>>>> input! >>>>>>> >>>>>>> >>>>>>> We got little late to get back on this. But after running tests >>>>>>> internally, we found possibly missing an volume option is the reason >>>>>>> for this: >>>>>>> >>>>>>> Try >>>>>>> >>>>>>> gluster volume set <volname> server.tcp-user-timeout 42 >>>>>>> on your volume. Let us know if this helps. >>>>>>> (Ref: https://review.gluster.org/#/c/glusterfs/+/21170/) >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sam McLeod >>>>>>> https://smcleod.net >>>>>>> https://twitter.com/s_mcleod >>>>>>> >>>>>>>> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists@xxxxxxxxxxx >>>>>>>> <mailto:mailinglists@xxxxxxxxxxx>> wrote: >>>>>>>> >>>>>>>> We've got an odd problem where clients are blocked from writing >>>>>>>> to Gluster volumes until the first node of the Gluster cluster is >>>>>>>> rebooted. >>>>>>>> >>>>>>>> I suspect I've either configured something incorrectly with the >>>>>>>> arbiter / replica configuration of the volumes, or there is some >>>>>>>> sort of bug in the gluster client-server connection that we're >>>>>>>> triggering. >>>>>>>> >>>>>>>> I was wondering if anyone has seen this or could point me in the >>>>>>>> right direction? >>>>>>>> >>>>>>>> >>>>>>>> *Environment:* >>>>>>>> >>>>>>>> * Typology: 3 node cluster, replica 2, arbiter 1 (third node is >>>>>>>> metadata only). >>>>>>>> * Version: Client and Servers both running 4.1.3, both on >>>>>>>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast >>>>>>>> networked SSD storage backing them, XFS. >>>>>>>> * Client: Native Gluster FUSE client mounting via the >>>>>>>> kubernetes provider >>>>>>>> >>>>>>>> >>>>>>>> *Problem:* >>>>>>>> >>>>>>>> * Seemingly randomly some clients will be blocked / are unable >>>>>>>> to write to what should be a highly available gluster volume. >>>>>>>> * The client gluster logs show it failing to do new file >>>>>>>> operations across various volumes and all three nodes of the >>>>>>>> gluster. >>>>>>>> * The server gluster (or OS) logs do not show any warnings or >>>>>>>> errors. >>>>>>>> * The client recovers and is able to write to volumes again >>>>>>>> after the first node of the gluster cluster is rebooted. >>>>>>>> * Until the first node of the gluster cluster is rebooted, the >>>>>>>> client fails to write to the volume that is (or should be) >>>>>>>> available on the second node (a replica) and third node (an >>>>>>>> arbiter only node). >>>>>>>> >>>>>>>> >>>>>>>> *What 'fixes' the issue:* >>>>>>>> >>>>>>>> * Although the clients (kubernetes hosts) connect to all 3 >>>>>>>> nodes of the Gluster cluster - restarting the first gluster >>>>>>>> node always unblocks the IO and allows the client to continue >>>>>>>> writing. >>>>>>>> * Stopping and starting the glusterd service on the gluster >>>>>>>> server is not enough to fix the issue, nor is restarting its >>>>>>>> networking. >>>>>>>> * This suggests to me that the volume is unavailable for >>>>>>>> writing for some reason and restarting the first node in the >>>>>>>> cluster either clears some sort of TCP sessions between the >>>>>>>> client-server or between the server-server replication. >>>>>>>> >>>>>>>> >>>>>>>> *Expected behaviour:* >>>>>>>> >>>>>>>> * If the first gluster node / server had failed or was blocked >>>>>>>> from performing operations for some reason (which it doesn't >>>>>>>> seem it is), I'd expect the clients to access data from the >>>>>>>> second gluster node and write metadata to the third gluster >>>>>>>> node as well as it's an arbiter / metadata only node. >>>>>>>> * If for some reason the a gluster node was not able to serve >>>>>>>> connections to clients, I'd expect to see errors in the >>>>>>>> volume, glusterd or brick log files (there are none on the >>>>>>>> first gluster node). >>>>>>>> * If the first gluster node was for some reason blocking IO on >>>>>>>> a volume, I'd expect that node either to show as unhealthy or >>>>>>>> unavailable in the gluster peer status or gluster volume status. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Client gluster errors:* >>>>>>>> >>>>>>>> * staging_static in this example is a volume name. >>>>>>>> * You can see the client trying to connect to the second and >>>>>>>> third nodes of the gluster cluster and failing (unsure as to >>>>>>>> why?) >>>>>>>> * The server side logs on the first gluster node do not show >>>>>>>> any errors or problems, but the second / third node show >>>>>>>> errors in the glusterd.log when trying to 'unlock' the >>>>>>>> 0-management volume on the first node. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *On a gluster client* (a kubernetes host using the kubernetes >>>>>>>> connector which uses the native fuse client) when its blocked >>>>>>>> from writing but the gluster appears healthy (other than the >>>>>>>> errors mentioned later): >>>>>>>> >>>>>>>> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 >>>>>>>> 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 15:33:22.750989] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 >>>>>>>> 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 16:03:23.097988] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 >>>>>>>> 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 16:33:23.439282] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 >>>>>>>> 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 17:03:23.786971] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 >>>>>>>> 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 17:33:24.160720] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 >>>>>>>> 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 18:03:24.505185] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 >>>>>>>> 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 18:33:24.841311] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 >>>>>>>> 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 19:03:25.204784] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 >>>>>>>> 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 19:33:25.533611] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 >>>>>>>> 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 20:03:25.877098] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 >>>>>>>> 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 20:33:26.217973] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 >>>>>>>> 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 21:03:26.588316] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 >>>>>>>> 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 21:33:26.912449] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 >>>>>>>> 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 22:03:37.259000] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 >>>>>>>> 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 22:33:37.615574] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 >>>>>>>> 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-02 23:03:37.941049] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 >>>>>>>> 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-02 23:33:38.271078] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 >>>>>>>> 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-03 00:03:38.607263] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 >>>>>>>> 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-03 00:33:38.934479] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 >>>>>>>> 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>>> [2018-09-03 01:03:39.256972] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] >>>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>>> v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 >>>>>>>> 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>>> [2018-09-03 01:33:39.614483] E [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>>> endpoint is not connected] >>>>>>>> >>>>>>>> >>>>>>>> *On the second gluster server:* >>>>>>>> >>>>>>>> >>>>>>>> We are seeing the following error in the glusterd.log file when >>>>>>>> the client is blocked from writing the volume, I think this is >>>>>>>> probably the most important information about the error and >>>>>>>> suggests a problem with the first node but doesn't explain the >>>>>>>> client behaviour: >>>>>>>> >>>>>>>> [2018-09-02 08:31:03.902272] E [MSGID: 106115] >>>>>>>> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: >>>>>>>> Unlocking failed on <FQDN of the first gluster node>. Please >>>>>>>> check log file for details. >>>>>>>> [2018-09-02 08:31:03.902477] E [MSGID: 106151] >>>>>>>> [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed >>>>>>>> to unlock on some peer(s) >>>>>>>> >>>>>>>> Note in the above error: >>>>>>>> >>>>>>>> 1. I'm not sure which log to check (there doesn't seem to be a >>>>>>>> management brick / brick log)? >>>>>>>> 2. If there's a problem with the first node, why isn't it >>>>>>>> rejected from the gluster / taken offline / the health of the >>>>>>>> peers or volume list degraded? >>>>>>>> 3. Why does the client fail to write to the volume rather than >>>>>>>> (I'm assuming) trying the second (or third I guess) node to write >>>>>>>> to the volume? >>>>>>>> >>>>>>>> >>>>>>>> We are also seeing the following errors repeated a lot in the >>>>>>>> logs, both when the volumes are working and when there's an issue >>>>>>>> in the brick log >>>>>>>> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): >>>>>>>> >>>>>>>> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] >>>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>>> [0x7f8470319d14] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>>> [0x7f846bdde24a] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>>> [2018-09-03 01:58:35.128957] E >>>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>>> submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: >>>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>>> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] >>>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>>> [0x7f8470319d14] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>>> [0x7f846bdde24a] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>>> [2018-09-03 01:58:35.129016] E >>>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>>> submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: >>>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>>> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] >>>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>>> [0x7f8470319d14] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>>> [0x7f846bdde24a] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>>> [2018-09-03 01:58:35.129077] E >>>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>>> submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: >>>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>>> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] >>>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>>> [0x7f8470319d14] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>>> [0x7f846bdde24a] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>>> [2018-09-03 01:58:35.129191] E >>>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>>> submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: >>>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>>> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] >>>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>>> [0x7f8470319d14] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>>> [0x7f846bdde24a] >>>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Gluster volume information:* >>>>>>>> >>>>>>>> >>>>>>>> # gluster volume info staging_static >>>>>>>> >>>>>>>> Volume Name: staging_static >>>>>>>> Type: Replicate >>>>>>>> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x (2 + 1) = 3 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static >>>>>>>> Brick2: <second gluster >>>>>>>> node.fqdn>:/mnt/gluster-storage/staging_static >>>>>>>> Brick3: <third gluster >>>>>>>> node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) >>>>>>>> Options Reconfigured: >>>>>>>> storage.fips-mode-rchecksum: true >>>>>>>> cluster.self-heal-window-size: 16 >>>>>>>> cluster.shd-wait-qlength: 4096 >>>>>>>> cluster.shd-max-threads: 8 >>>>>>>> performance.cache-min-file-size: 2KB >>>>>>>> performance.rda-cache-limit: 1GB >>>>>>>> network.inode-lru-limit: 50000 >>>>>>>> server.outstanding-rpc-limit: 256 >>>>>>>> transport.listen-backlog: 2048 >>>>>>>> performance.write-behind-window-size: 512MB >>>>>>>> performance.stat-prefetch: true >>>>>>>> performance.io <http://performance.io/>-thread-count: 16 >>>>>>>> performance.client-io-threads: true >>>>>>>> performance.cache-size: 1GB >>>>>>>> performance.cache-refresh-timeout: 60 >>>>>>>> performance.cache-invalidation: true >>>>>>>> cluster.use-compound-fops: true >>>>>>>> cluster.readdir-optimize: true >>>>>>>> cluster.lookup-optimize: true >>>>>>>> cluster.favorite-child-policy: size >>>>>>>> cluster.eager-lock: true >>>>>>>> client.event-threads: 4 >>>>>>>> nfs.disable: on >>>>>>>> transport.address-family: inet >>>>>>>> diagnostics.brick-log-level: ERROR >>>>>>>> diagnostics.client-log-level: ERROR >>>>>>>> features.cache-invalidation-timeout: 300 >>>>>>>> features.cache-invalidation: true >>>>>>>> network.ping-timeout: 15 >>>>>>>> performance.cache-max-file-size: 3MB >>>>>>>> performance.md-cache-timeout: 300 >>>>>>>> server.event-threads: 4 >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sam McLeod (protoporpoise on IRC) >>>>>>>> https://smcleod.net <https://smcleod.net/> >>>>>>>> https://twitter.com/s_mcleod >>>>>>>> >>>>>>>> Words are my own opinions and do not necessarily represent those >>>>>>>> of my employer or partners. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Amar Tumballi (amarts) >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users@xxxxxxxxxxx >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users@xxxxxxxxxxx >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users@xxxxxxxxxxx >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > https://lists.gluster.org/mailman/listinfo/gluster-users
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users