Initially I was suspecting about server-quorum be the culprit which is not the case. By any chance is your network flaky? On 02/01/2016 10:33 PM, Logan Barfield wrote: > Volume Name: data02 > Type: Replicate > Volume ID: 1c8928b1-f49e-4950-be06-0f8ce5adf870 > Status: Started > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: gluster-stor01:/export/data/brick02 <-- 10.1.1.10 > Brick2: gluster-stor02:/export/data/brick02 <-- 10.1.1.11 > Options Reconfigured: > server.event-threads: 5 > client.event-threads: 11 > geo-replication.indexing: on > geo-replication.ignore-pid-check: on > changelog.changelog: on > server.statedump-path: /tmp > server.outstanding-rpc-limit: 128 > performance.io-thread-count: 64 > performance.nfs.read-ahead: on > performance.nfs.io-cache: on > performance.nfs.quick-read: on > performance.cache-max-file-size: 1MB > performance.client-io-threads: on > cluster.lookup-optimize: on > performance.cache-size: 1073741824 > performance.write-behind-window-size: 4MB > performance.nfs.write-behind-window-size: 4MB > performance.read-ahead: off > performance.nfs.stat-prefetch: on > > > Status of volume: data02 > Gluster process TCP Port RDMA Port Online Pid > ------------------------------------------------------------------------------ > Brick gluster-stor01:/export/data/brick02 49153 0 Y > 17411 > Brick gluster-stor02:/export/data/brick02 49155 0 Y > 4717 > NFS Server on localhost 2049 0 Y > 17395 > Self-heal Daemon on localhost N/A N/A Y > 17405 > NFS Server on gluster-stor02 2049 0 Y > 4701 > Self-heal Daemon on gluster-stor02 N/A N/A Y > 4712 > > Task Status of Volume data02 > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > Note that this problem was occurring with the same frequency before we > added all of the volume options above. We were running defaults up > until last week, and changing them had no impact on this particular problem. > > > > > Thank You, > > Logan Barfield > Tranquil Hosting > > On Fri, Jan 29, 2016 at 9:28 PM, Atin Mukherjee <amukherj@xxxxxxxxxx > <mailto:amukherj@xxxxxxxxxx>> wrote: > > Could you paste output of gluster volume info? > > ~Atin > > On 01/29/2016 11:59 PM, Logan Barfield wrote: > > We're running a fairly large 2-replica volume across two servers. The > > volume is approximately 20TB of small 1K-4MB files. The volume is > > exported via NFS, and mounted remotely by two clients. > > > > For the past few weeks the Gluster brick processes have been randomly > > restarting. Luckily they've been doing so at non-peak times, so we > > didn't notice until our monitoring checks happened to pick up on > zombied > > 'glusterfs' process. > > > > From the logs it looks like something is blocking communication to the > > brick processes, and Gluster automatically restarts everything to > > compensate. I've so far not been able to figure out the > underlying cause. > > > > I've included log snippets from 'glustershd.log' and > > 'etc-glusterfs-glusterd.vol.log' here. If anyone can provide some > > insight into the issue it would be greatly appreciated. I'll also be > > happy to provide any further details as needed. > > > > > > [2016-01-29 05:03:47.039886] I [MSGID: 106144] > > [glusterd-pmap.c:274:pmap_registry_remove] 0-pmap: removing brick > > /export/data/brick02 on port 49155 > > [2016-01-29 05:03:47.075521] W [socket.c:588:__socket_rwv] > 0-management: > > readv on /var/run/gluster/53a233b05f5d4be45dc94391bc3ebfe5.socket > failed > > (No data available) > > [2016-01-29 05:03:47.078282] I [MSGID: 106005] > > [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management: > > Brick gluster-stor02:/export/data/brick02 has disconnected from > glusterd. > > [2016-01-29 05:03:47.149161] W [glusterfsd.c:1236:cleanup_and_exit] > > (-->/lib64/libpthread.so.0() [0x3e47a079d1] > > -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x405e6d] > > -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: recei > > ved signum (15), shutting down > > [2016-01-29 05:03:54.067012] I [MSGID: 100030] > [glusterfsd.c:2318:main] > > 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 > > (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid) > > [2016-01-29 05:03:54.071901] I [MSGID: 106478] [glusterd.c:1350:init] > > 0-management: Maximum allowed open file descriptors set to 65536 > > [2016-01-29 05:03:54.071935] I [MSGID: 106479] [glusterd.c:1399:init] > > 0-management: Using /var/lib/glusterd as working directory > > [2016-01-29 05:03:54.075655] E > [rpc-transport.c:292:rpc_transport_load] > > 0-rpc-transport: /usr/lib64/glusterfs/3.7.6/rpc-transport/rdma.so: > > cannot open shared object file: No such file or directory > > [2016-01-29 05:03:54.075672] W > [rpc-transport.c:296:rpc_transport_load] > > 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' > is not > > valid or not found on this machine > > [2016-01-29 05:03:54.075680] W [rpcsvc.c:1597:rpcsvc_transport_create] > > 0-rpc-service: cannot create listener, initing the transport failed > > [2016-01-29 05:03:54.075687] E [MSGID: 106243] [glusterd.c:1623:init] > > 0-management: creation of 1 listeners failed, continuing with > succeeded > > transport > > [2016-01-29 05:03:55.869717] I [MSGID: 106513] > > [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: > > retrieved op-version: 30702 > > [2016-01-29 05:03:55.995747] I [MSGID: 106498] > > [glusterd-handler.c:3579:glusterd_friend_add_from_peerinfo] > > 0-management: connect returned 0 > > [2016-01-29 05:03:55.995866] I > [rpc-clnt.c:984:rpc_clnt_connection_init] > > 0-management: setting frame-timeout to 600 > > [2016-01-29 05:03:56.000937] I [MSGID: 106544] > > [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: > > 9b103ea8-d248-44fc-8f80-3e87f7c4971c > > Final graph: > > > +------------------------------------------------------------------------------+ > > 1: volume management > > 2: type mgmt/glusterd > > 3: option rpc-auth.auth-glusterfs on > > 4: option rpc-auth.auth-unix on > > 5: option rpc-auth.auth-null on > > 6: option rpc-auth-allow-insecure on > > 7: option transport.socket.listen-backlog 128 > > 8: option ping-timeout 30 > > 9: option transport.socket.read-fail-log off > > 10: option transport.socket.keepalive-interval 2 > > 11: option transport.socket.keepalive-time 10 > > 12: option transport-type rdma > > 13: option working-directory /var/lib/glusterd > > 14: end-volume > > 15: > > > +------------------------------------------------------------------------------+ > > [2016-01-29 05:03:56.002570] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 2 > > [2016-01-29 05:03:56.003098] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 1 > > [2016-01-29 05:03:56.003158] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 2 > > [2016-01-29 05:03:56.855628] I [MSGID: 106493] > > [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: > Received > > ACC from uuid: 388a8bb4-c530-44ff-838b-8f7b9e4c95db, host: 10.1.1.10, > > port: 0 > > [2016-01-29 05:03:56.856787] I > [rpc-clnt.c:984:rpc_clnt_connection_init] > > 0-nfs: setting frame-timeout to 600 > > [2016-01-29 05:03:57.859093] I [MSGID: 106540] > > [glusterd-utils.c:4191:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered MOUNTV3 successfully > > [2016-01-29 05:03:57.860228] I [MSGID: 106540] > > [glusterd-utils.c:4200:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered MOUNTV1 successfully > > [2016-01-29 05:03:57.861329] I [MSGID: 106540] > > [glusterd-utils.c:4209:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered NFSV3 successfully > > [2016-01-29 05:03:57.862421] I [MSGID: 106540] > > [glusterd-utils.c:4218:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered NLM v4 successfully > > [2016-01-29 05:03:57.863510] I [MSGID: 106540] > > [glusterd-utils.c:4227:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered NLM v1 successfully > > [2016-01-29 05:03:57.864600] I [MSGID: 106540] > > [glusterd-utils.c:4236:glusterd_nfs_pmap_deregister] 0-glusterd: > > De-registered ACL v3 successfully > > [2016-01-29 05:03:57.870948] W [socket.c:3009:socket_connect] 0-nfs: > > Ignore failed connection attempt on , (No such file or directory) > > > > > > > > > > > > > > > > [2016-01-29 05:03:47.075614] W [socket.c:588:__socket_rwv] > > 0-data02-client-1: readv on 10.1.1.10:49155 <http://10.1.1.10:49155> > > <http://10.1.1.10:49155/> failed (No data available) > > [2016-01-29 05:03:47.076871] I [MSGID: 114018] > > [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from > > data02-client-1. Client process will keep trying to connect to glusterd > > until brick's port is available > > [2016-01-29 05:03:47.170284] W [socket.c:588:__socket_rwv] 0-glusterfs: > > readv on 127.0.0.1:24007 <http://127.0.0.1:24007> > <http://127.0.0.1:24007/> failed (No data > > available) > > [2016-01-29 05:03:47.639163] W [socket.c:588:__socket_rwv] > > 0-data02-client-0: readv on 10.1.1.11:49153 <http://10.1.1.11:49153> > > <http://10.1.1.11:49153/> failed (No data available) > > [2016-01-29 05:03:47.639206] I [MSGID: 114018] > > [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from > > data02-client-0. Client process will keep trying to connect to > glusterd > > until brick's port is available > > [2016-01-29 05:03:47.640222] E [MSGID: 108006] > > [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All > subvolumes are > > down. Going offline until atleast one of them comes back up. > > [2016-01-29 05:03:57.872983] W [glusterfsd.c:1236:cleanup_and_exit] > > (-->/lib64/libpthread.so.0() [0x3e47a079d1] > > -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e6d] > > -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059d5] ) 0-: rec > > eived signum (15), shutting down > > [2016-01-29 05:03:58.881541] I [MSGID: 100030] > [glusterfsd.c:2318:main] > > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version > 3.7.6 > > (args: /usr/sbin/glusterfs -s localhost --volfile-id > gluster/glustershd > > -p /var/lib/glusterd/ > > glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S > > /var/run/gluster/8d72de580ccac07d2ecfc2491a9b1648.socket > --xlator-option > > *replicate*.node-uuid=9b103ea8-d248-44fc-8f80-3e87f7c4971c) > > [2016-01-29 05:03:58.890833] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 1 > > [2016-01-29 05:03:59.340030] I [graph.c:269:gf_add_cmdline_options] > > 0-data02-replicate-0: adding option 'node-uuid' for volume > > 'data02-replicate-0' with value '9b103ea8-d248-44fc-8f80-3e87f7c4971c' > > [2016-01-29 05:03:59.342682] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 2 > > [2016-01-29 05:03:59.342742] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 3 > > [2016-01-29 05:03:59.342827] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 4 > > [2016-01-29 05:03:59.342892] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 5 > > [2016-01-29 05:03:59.342917] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 6 > > [2016-01-29 05:03:59.343563] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 8 > > [2016-01-29 05:03:59.343569] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 7 > > [2016-01-29 05:03:59.343657] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 9 > > [2016-01-29 05:03:59.343705] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 11 > > [2016-01-29 05:03:59.343710] I [MSGID: 101190] > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > with index 10 > > [2016-01-29 05:03:59.344278] I [MSGID: 114020] [client.c:2118:notify] > > 0-data02-client-0: parent translators are ready, attempting connect on > > transport > > [2016-01-29 05:03:59.346553] I [MSGID: 114020] [client.c:2118:notify] > > 0-data02-client-1: parent translators are ready, attempting connect on > > transport > > Final graph: > > > +------------------------------------------------------------------------------+ > > 1: volume data02-client-0 > > 2: type protocol/client > > 3: option ping-timeout 42 > > 4: option remote-host gluster-stor01 > > 5: option remote-subvolume /export/data/brick02 > > 6: option transport-type socket > > 7: option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452 > > 8: option password 66b85782-5833-4f2d-ad0e-8de75247b094F > > 9: option event-threads 11 > > 10: end-volume > > 11: > > 12: volume data02-client-1 > > 13: type protocol/client > > 14: option ping-timeout 42 > > 15: option remote-host gluster-stor02 > > 16: option remote-subvolume /export/data/brick02 > > 17: option transport-type socket > > 18: option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452 > > 19: option password 66b85782-5833-4f2d-ad0e-8de75247b094 > > 20: option event-threads 11 > > 21: end-volume > > 22: > > 23: volume data02-replicate-0 > > 24: type cluster/replicate > > 25: option node-uuid 9b103ea8-d248-44fc-8f80-3e87f7c4971c > > 26: option background-self-heal-count 0 > > 27: option metadata-self-heal on > > 28: option data-self-heal on > > 29: option entry-self-heal on > > 30: option self-heal-daemon enable > > 31: option iam-self-heal-daemon yes > > 32: subvolumes data02-client-0 data02-client-1 > > 33: end-volume > > 34: > > 35: volume glustershd > > 36: type debug/io-stats > > 37: subvolumes data02-replicate-0 > > 38: end-volume > > 39: > > > +------------------------------------------------------------------------------+ > > [2016-01-29 05:03:59.348913] E [MSGID: 114058] > > [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-1: > > failed to get the port number for remote subvolume. Please run > 'gluster > > volume status' on server to see if brick process > > is running. > > [2016-01-29 05:03:59.348960] I [MSGID: 114018] > > [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from > > data02-client-1. Client process will keep trying to connect to > glusterd > > until brick's port is available > > [2016-01-29 05:03:59.436909] E [MSGID: 114058] > > [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-0: > > failed to get the port number for remote subvolume. Please run > 'gluster > > volume status' on server to see if brick process > > is running. > > [2016-01-29 05:03:59.436974] I [MSGID: 114018] > > [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from > > data02-client-0. Client process will keep trying to connect to > glusterd > > until brick's port is available > > [2016-01-29 05:03:59.436991] E [MSGID: 108006] > > [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All > subvolumes are > > down. Going offline until atleast one of them comes back up. > > [2016-01-29 05:04:02.886317] I [rpc-clnt.c:1847:rpc_clnt_reconfig] > > 0-data02-client-0: changing port to 49153 (from 0) > > [2016-01-29 05:04:02.888761] I [rpc-clnt.c:1847:rpc_clnt_reconfig] > > 0-data02-client-1: changing port to 49155 (from 0) > > [2016-01-29 05:04:02.891105] I [MSGID: 114057] > > [client-handshake.c:1437:select_server_supported_programs] > > 0-data02-client-0: Using Program GlusterFS 3.3, Num (1298437), > Version (330) > > [2016-01-29 05:04:02.891360] I [MSGID: 114046] > > [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-0: > > Connected to data02-client-0, attached to remote volume > > '/export/data/brick02'. > > [2016-01-29 05:04:02.891373] I [MSGID: 114047] > > [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-0: > Server > > and Client lk-version numbers are not same, reopening the fds > > [2016-01-29 05:04:02.891403] I [MSGID: 108005] > > [afr-common.c:3841:afr_notify] 0-data02-replicate-0: Subvolume > > 'data02-client-0' came back up; going online. > > [2016-01-29 05:04:02.891518] I [MSGID: 114035] > > [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-0: > > Server lk version = 1 > > [2016-01-29 05:04:02.893074] I [MSGID: 114057] > > [client-handshake.c:1437:select_server_supported_programs] > > 0-data02-client-1: Using Program GlusterFS 3.3, Num (1298437), > Version (330) > > [2016-01-29 05:04:02.893251] I [MSGID: 114046] > > [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-1: > > Connected to data02-client-1, attached to remote volume > > '/export/data/brick02'. > > [2016-01-29 05:04:02.893276] I [MSGID: 114047] > > [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-1: > Server > > and Client lk-version numbers are not same, reopening the fds > > [2016-01-29 05:04:02.893401] I [MSGID: 114035] > > [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-1: > > Server lk version = 1 > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users