Hi all, We are having a sporadic issue with our Gluster mounts that is affecting several of our Kubernetes environments. We are having trouble understanding what is causing it, and we could use some guidance from
the pros! Scenario We have an environment running a single-node Kubernetes with Heketi and several pods using Gluster mounts. The environment runs fine and the mounts appear to be healthy for up to several days. Suddenly, one
or more (sometimes all) Gluster mounts report a stale mount and shut down the brick. The affected containers enter a crash loop that continues indefinitely, until someone intervenes. To work-around the crash loop, a user needs to trigger the bricks to be started
again--either through manually starting them, restarting the Gluster pod or restarting the entire node. Diagnostics Looking at the glusterd.log file, the error message at the time the problem
starts looks something like this: got disconnect from stale rpc on /var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick This message occurs once for each brick that stops responding. The brick does not recover on its own. Here is that same message again, with surrounding context included. [2019-05-07 11:53:38.663362] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5)
[0x7f795f0d77a5] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) [0x7f795f17f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh --volname=vol_d0a0dcf9903e236f68a3933c3060ec5a
--last=no [2019-05-07 11:53:38.905338] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5)
[0x7f795f0d77a5] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3) [0x7f795f17f6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh --volname=vol_d0a0dcf9903e236f68a3933c3060ec5a
--last=no [2019-05-07 11:53:38.982785] I [MSGID: 106542] [glusterd-utils.c:8253:glusterd_brick_signal] 0-glusterd:
sending signal 15 to brick with pid 8951 [2019-05-07 11:53:39.983244] I [MSGID: 106143] [glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing
brick /var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick on port 49169 [2019-05-07 11:53:39.984656] W [glusterd-handler.c:6124:__glusterd_brick_rpc_notify] 0-management: got
disconnect from stale rpc on /var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick [2019-05-07 11:53:40.316466] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management:
nfs already stopped [2019-05-07 11:53:40.316601] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management:
nfs service is stopped [2019-05-07 11:53:40.316644] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management:
nfs/server.so xlator is not installed [2019-05-07 11:53:40.319650] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management:
bitd already stopped [2019-05-07 11:53:40.319708] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management:
bitd service is stopped [2019-05-07 11:53:40.321091] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management:
scrub already stopped [2019-05-07 11:53:40.321132] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management:
scrub service is stopped The version of gluster we are using (running in a container, using the gluster/gluster-centos image from dockerhub): # rpm -qa | grep gluster glusterfs-rdma-4.1.7-1.el7.x86_64 gluster-block-0.3-2.el7.x86_64 python2-gluster-4.1.7-1.el7.x86_64 centos-release-gluster41-1.0-3.el7.centos.noarch glusterfs-4.1.7-1.el7.x86_64 glusterfs-api-4.1.7-1.el7.x86_64 glusterfs-cli-4.1.7-1.el7.x86_64 glusterfs-geo-replication-4.1.7-1.el7.x86_64 glusterfs-libs-4.1.7-1.el7.x86_64 glusterfs-client-xlators-4.1.7-1.el7.x86_64 glusterfs-fuse-4.1.7-1.el7.x86_64 glusterfs-server-4.1.7-1.el7.x86_64 The version of gluster running on our Kubernetes node (a CentOS system): ]$ rpm -qa | grep gluster glusterfs-libs-3.12.2-18.el7.x86_64 glusterfs-3.12.2-18.el7.x86_64 glusterfs-fuse-3.12.2-18.el7.x86_64 glusterfs-client-xlators-3.12.2-18.el7.x86_64 The Kubernetes version: $
kubectl version Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Full Gluster logs available if needed, just let me know how to provide them. Thanks in advance for any help or suggestions on this! Best, Jeff Bischoff Turbonomic |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users