We're running a fairly large 2-replica volume across two servers. The volume is approximately 20TB of small 1K-4MB files. The volume is exported via NFS, and mounted remotely by two clients.
For the past few weeks the Gluster brick processes have been randomly restarting. Luckily they've been doing so at non-peak times, so we didn't notice until our monitoring checks happened to pick up on zombied 'glusterfs' process.
From the logs it looks like something is blocking communication to the brick processes, and Gluster automatically restarts everything to compensate. I've so far not been able to figure out the underlying cause.
I've included log snippets from 'glustershd.log' and 'etc-glusterfs-glusterd.vol.log' here. If anyone can provide some insight into the issue it would be greatly appreciated. I'll also be happy to provide any further details as needed.
[2016-01-29 05:03:47.039886] I [MSGID: 106144] [glusterd-pmap.c:274:pmap_registry_remove] 0-pmap: removing brick /export/data/brick02 on port 49155
[2016-01-29 05:03:47.075521] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/53a233b05f5d4be45dc94391bc3ebfe5.socket failed (No data available)
[2016-01-29 05:03:47.078282] I [MSGID: 106005] [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management: Brick gluster-stor02:/export/data/brick02 has disconnected from glusterd.
[2016-01-29 05:03:47.149161] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0() [0x3e47a079d1] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x405e6d] -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: recei
ved signum (15), shutting down
[2016-01-29 05:03:54.067012] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2016-01-29 05:03:54.071901] I [MSGID: 106478] [glusterd.c:1350:init] 0-management: Maximum allowed open file descriptors set to 65536
[2016-01-29 05:03:54.071935] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /var/lib/glusterd as working directory
[2016-01-29 05:03:54.075655] E [rpc-transport.c:292:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.7.6/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2016-01-29 05:03:54.075672] W [rpc-transport.c:296:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2016-01-29 05:03:54.075680] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2016-01-29 05:03:54.075687] E [MSGID: 106243] [glusterd.c:1623:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2016-01-29 05:03:55.869717] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30702
[2016-01-29 05:03:55.995747] I [MSGID: 106498] [glusterd-handler.c:3579:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2016-01-29 05:03:55.995866] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2016-01-29 05:03:56.000937] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: 9b103ea8-d248-44fc-8f80-3e87f7c4971c
Final graph:
+------------------------------------------------------------------------------+
1: volume management
2: type mgmt/glusterd
3: option rpc-auth.auth-glusterfs on
4: option rpc-auth.auth-unix on
5: option rpc-auth.auth-null on
6: option rpc-auth-allow-insecure on
7: option transport.socket.listen-backlog 128
8: option ping-timeout 30
9: option transport.socket.read-fail-log off
10: option transport.socket.keepalive-interval 2
11: option transport.socket.keepalive-time 10
12: option transport-type rdma
13: option working-directory /var/lib/glusterd
14: end-volume
15:
+------------------------------------------------------------------------------+
[2016-01-29 05:03:56.002570] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2016-01-29 05:03:56.003098] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-01-29 05:03:56.003158] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2016-01-29 05:03:56.855628] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 388a8bb4-c530-44ff-838b-8f7b9e4c95db, host: 10.1.1.10, port: 0
[2016-01-29 05:03:56.856787] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2016-01-29 05:03:57.859093] I [MSGID: 106540] [glusterd-utils.c:4191:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully
[2016-01-29 05:03:57.860228] I [MSGID: 106540] [glusterd-utils.c:4200:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully
[2016-01-29 05:03:57.861329] I [MSGID: 106540] [glusterd-utils.c:4209:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully
[2016-01-29 05:03:57.862421] I [MSGID: 106540] [glusterd-utils.c:4218:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully
[2016-01-29 05:03:57.863510] I [MSGID: 106540] [glusterd-utils.c:4227:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully
[2016-01-29 05:03:57.864600] I [MSGID: 106540] [glusterd-utils.c:4236:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully
[2016-01-29 05:03:57.870948] W [socket.c:3009:socket_connect] 0-nfs: Ignore failed connection attempt on , (No such file or directory)
[2016-01-29 05:03:47.075614] W [socket.c:588:__socket_rwv] 0-data02-client-1: readv on 10.1.1.10:49155 failed (No data available)
[2016-01-29 05:03:47.076871] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from data02-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2016-01-29 05:03:47.170284] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available)
[2016-01-29 05:03:47.639163] W [socket.c:588:__socket_rwv] 0-data02-client-0: readv on 10.1.1.11:49153 failed (No data available)
[2016-01-29 05:03:47.639206] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from data02-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2016-01-29 05:03:47.640222] E [MSGID: 108006] [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2016-01-29 05:03:57.872983] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0() [0x3e47a079d1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e6d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059d5] ) 0-: rec
eived signum (15), shutting down
[2016-01-29 05:03:58.881541] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.6 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/
glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/8d72de580ccac07d2ecfc2491a9b1648.socket --xlator-option *replicate*.node-uuid=9b103ea8-d248-44fc-8f80-3e87f7c4971c)
[2016-01-29 05:03:58.890833] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-01-29 05:03:59.340030] I [graph.c:269:gf_add_cmdline_options] 0-data02-replicate-0: adding option 'node-uuid' for volume 'data02-replicate-0' with value '9b103ea8-d248-44fc-8f80-3e87f7c4971c'
[2016-01-29 05:03:59.342682] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2016-01-29 05:03:59.342742] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2016-01-29 05:03:59.342827] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2016-01-29 05:03:59.342892] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5
[2016-01-29 05:03:59.342917] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6
[2016-01-29 05:03:59.343563] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 8
[2016-01-29 05:03:59.343569] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7
[2016-01-29 05:03:59.343657] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 9
[2016-01-29 05:03:59.343705] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 11
[2016-01-29 05:03:59.343710] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 10
[2016-01-29 05:03:59.344278] I [MSGID: 114020] [client.c:2118:notify] 0-data02-client-0: parent translators are ready, attempting connect on transport
[2016-01-29 05:03:59.346553] I [MSGID: 114020] [client.c:2118:notify] 0-data02-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume data02-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host gluster-stor01
5: option remote-subvolume /export/data/brick02
6: option transport-type socket
7: option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452
8: option password 66b85782-5833-4f2d-ad0e-8de75247b094F
9: option event-threads 11
10: end-volume
11:
12: volume data02-client-1
13: type protocol/client
14: option ping-timeout 42
15: option remote-host gluster-stor02
16: option remote-subvolume /export/data/brick02
17: option transport-type socket
18: option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452
19: option password 66b85782-5833-4f2d-ad0e-8de75247b094
20: option event-threads 11
21: end-volume
22:
23: volume data02-replicate-0
24: type cluster/replicate
25: option node-uuid 9b103ea8-d248-44fc-8f80-3e87f7c4971c
26: option background-self-heal-count 0
27: option metadata-self-heal on
28: option data-self-heal on
29: option entry-self-heal on
30: option self-heal-daemon enable
31: option iam-self-heal-daemon yes
32: subvolumes data02-client-0 data02-client-1
33: end-volume
34:
35: volume glustershd
36: type debug/io-stats
37: subvolumes data02-replicate-0
38: end-volume
39:
+------------------------------------------------------------------------------+
[2016-01-29 05:03:59.348913] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-01-29 05:03:59.348960] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from data02-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2016-01-29 05:03:59.436909] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-01-29 05:03:59.436974] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from data02-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2016-01-29 05:03:59.436991] E [MSGID: 108006] [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2016-01-29 05:04:02.886317] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-data02-client-0: changing port to 49153 (from 0)
[2016-01-29 05:04:02.888761] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-data02-client-1: changing port to 49155 (from 0)
[2016-01-29 05:04:02.891105] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-data02-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-01-29 05:04:02.891360] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-0: Connected to data02-client-0, attached to remote volume '/export/data/brick02'.
[2016-01-29 05:04:02.891373] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2016-01-29 05:04:02.891403] I [MSGID: 108005] [afr-common.c:3841:afr_notify] 0-data02-replicate-0: Subvolume 'data02-client-0' came back up; going online.
[2016-01-29 05:04:02.891518] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-0: Server lk version = 1
[2016-01-29 05:04:02.893074] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-data02-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-01-29 05:04:02.893251] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-1: Connected to data02-client-1, attached to remote volume '/export/data/brick02'.
[2016-01-29 05:04:02.893276] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2016-01-29 05:04:02.893401] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-1: Server lk version = 1
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel