Re: Losing connection to bricks, Gluster processes restarting

Logan Barfield <lbarfield@xxxxxxxxxxxxx> · Mon, 1 Feb 2016 12:03:46 -0500

Volume Name: data02
Type: Replicate
Volume ID: 1c8928b1-f49e-4950-be06-0f8ce5adf870
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster-stor01:/export/data/brick02   <-- 10.1.1.10
Brick2: gluster-stor02:/export/data/brick02   <-- 10.1.1.11
Options Reconfigured:
server.event-threads: 5
client.event-threads: 11
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
server.statedump-path: /tmp
server.outstanding-rpc-limit: 128
performance.io-thread-count: 64
performance.nfs.read-ahead: on
performance.nfs.io-cache: on
performance.nfs.quick-read: on
performance.cache-max-file-size: 1MB
performance.client-io-threads: on
cluster.lookup-optimize: on
performance.cache-size: 1073741824
performance.write-behind-window-size: 4MB
performance.nfs.write-behind-window-size: 4MB
performance.read-ahead: off
performance.nfs.stat-prefetch: on

Status of volume: data02
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster-stor01:/export/data/brick02      49153     0          Y       17411
Brick gluster-stor02:/export/data/brick02      49155     0          Y       4717 
NFS Server on localhost                     2049      0          Y       17395
Self-heal Daemon on localhost               N/A       N/A        Y       17405
NFS Server on gluster-stor02                   2049      0          Y       4701 
Self-heal Daemon on gluster-stor02             N/A       N/A        Y       4712 

Task Status of Volume data02
------------------------------------------------------------------------------
There are no active volume tasks

Note that this problem was occurring with the same frequency before we added all of the volume options above.  We were running defaults up until last week, and changing them had no impact on this particular problem.

Thank You,

Logan Barfield
Tranquil Hosting

On Fri, Jan 29, 2016 at 9:28 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Could you paste output of gluster volume info?

~Atin

On 01/29/2016 11:59 PM, Logan Barfield wrote:

> We're running a fairly large 2-replica volume across two servers.  The

> volume is approximately 20TB of small 1K-4MB files.  The volume is

> exported via NFS, and mounted remotely by two clients.

>

> For the past few weeks the Gluster brick processes have been randomly

> restarting.  Luckily they've been doing so at non-peak times, so we

> didn't notice until our monitoring checks happened to pick up on zombied

> 'glusterfs' process.

>

> From the logs it looks like something is blocking communication to the

> brick processes, and Gluster automatically restarts everything to

> compensate.  I've so far not been able to figure out the underlying cause.

>

> I've included log snippets from 'glustershd.log' and

> 'etc-glusterfs-glusterd.vol.log' here.  If anyone can provide some

> insight into the issue it would be greatly appreciated.  I'll also be

> happy to provide any further details as needed.

>

>

> [2016-01-29 05:03:47.039886] I [MSGID: 106144]

> [glusterd-pmap.c:274:pmap_registry_remove] 0-pmap: removing brick

> /export/data/brick02 on port 49155

> [2016-01-29 05:03:47.075521] W [socket.c:588:__socket_rwv] 0-management:

> readv on /var/run/gluster/53a233b05f5d4be45dc94391bc3ebfe5.socket failed

> (No data available)

> [2016-01-29 05:03:47.078282] I [MSGID: 106005]

> [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:

> Brick gluster-stor02:/export/data/brick02 has disconnected from glusterd.

> [2016-01-29 05:03:47.149161] W [glusterfsd.c:1236:cleanup_and_exit]

> (-->/lib64/libpthread.so.0() [0x3e47a079d1]

> -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x405e6d]

> -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: recei

> ved signum (15), shutting down

> [2016-01-29 05:03:54.067012] I [MSGID: 100030] [glusterfsd.c:2318:main]

> 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6

> (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)

> [2016-01-29 05:03:54.071901] I [MSGID: 106478] [glusterd.c:1350:init]

> 0-management: Maximum allowed open file descriptors set to 65536

> [2016-01-29 05:03:54.071935] I [MSGID: 106479] [glusterd.c:1399:init]

> 0-management: Using /var/lib/glusterd as working directory

> [2016-01-29 05:03:54.075655] E [rpc-transport.c:292:rpc_transport_load]

> 0-rpc-transport: /usr/lib64/glusterfs/3.7.6/rpc-transport/rdma.so:

> cannot open shared object file: No such file or directory

> [2016-01-29 05:03:54.075672] W [rpc-transport.c:296:rpc_transport_load]

> 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not

> valid or not found on this machine

> [2016-01-29 05:03:54.075680] W [rpcsvc.c:1597:rpcsvc_transport_create]

> 0-rpc-service: cannot create listener, initing the transport failed

> [2016-01-29 05:03:54.075687] E [MSGID: 106243] [glusterd.c:1623:init]

> 0-management: creation of 1 listeners failed, continuing with succeeded

> transport

> [2016-01-29 05:03:55.869717] I [MSGID: 106513]

> [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd:

> retrieved op-version: 30702

> [2016-01-29 05:03:55.995747] I [MSGID: 106498]

> [glusterd-handler.c:3579:glusterd_friend_add_from_peerinfo]

> 0-management: connect returned 0

> [2016-01-29 05:03:55.995866] I [rpc-clnt.c:984:rpc_clnt_connection_init]

> 0-management: setting frame-timeout to 600

> [2016-01-29 05:03:56.000937] I [MSGID: 106544]

> [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID:

> 9b103ea8-d248-44fc-8f80-3e87f7c4971c

> Final graph:

> +------------------------------------------------------------------------------+

>   1: volume management

>   2:     type mgmt/glusterd

>   3:     option rpc-auth.auth-glusterfs on

>   4:     option rpc-auth.auth-unix on

>   5:     option rpc-auth.auth-null on

>   6:     option rpc-auth-allow-insecure on

>   7:     option transport.socket.listen-backlog 128

>   8:     option ping-timeout 30

>   9:     option transport.socket.read-fail-log off

>  10:     option transport.socket.keepalive-interval 2

>  11:     option transport.socket.keepalive-time 10

>  12:     option transport-type rdma

>  13:     option working-directory /var/lib/glusterd

>  14: end-volume

>  15:

> +------------------------------------------------------------------------------+

> [2016-01-29 05:03:56.002570] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 2

> [2016-01-29 05:03:56.003098] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 1

> [2016-01-29 05:03:56.003158] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 2

> [2016-01-29 05:03:56.855628] I [MSGID: 106493]

> [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received

> ACC from uuid: 388a8bb4-c530-44ff-838b-8f7b9e4c95db, host: 10.1.1.10,

> port: 0

> [2016-01-29 05:03:56.856787] I [rpc-clnt.c:984:rpc_clnt_connection_init]

> 0-nfs: setting frame-timeout to 600

> [2016-01-29 05:03:57.859093] I [MSGID: 106540]

> [glusterd-utils.c:4191:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered MOUNTV3 successfully

> [2016-01-29 05:03:57.860228] I [MSGID: 106540]

> [glusterd-utils.c:4200:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered MOUNTV1 successfully

> [2016-01-29 05:03:57.861329] I [MSGID: 106540]

> [glusterd-utils.c:4209:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered NFSV3 successfully

> [2016-01-29 05:03:57.862421] I [MSGID: 106540]

> [glusterd-utils.c:4218:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered NLM v4 successfully

> [2016-01-29 05:03:57.863510] I [MSGID: 106540]

> [glusterd-utils.c:4227:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered NLM v1 successfully

> [2016-01-29 05:03:57.864600] I [MSGID: 106540]

> [glusterd-utils.c:4236:glusterd_nfs_pmap_deregister] 0-glusterd:

> De-registered ACL v3 successfully

> [2016-01-29 05:03:57.870948] W [socket.c:3009:socket_connect] 0-nfs:

> Ignore failed connection attempt on , (No such file or directory)

>

>

>

>

>

>

>

> [2016-01-29 05:03:47.075614] W [socket.c:588:__socket_rwv]

> 0-data02-client-1: readv on 10.1.1.10:49155

> <http://10.1.1.10:49155/> failed (No data available)

> [2016-01-29 05:03:47.076871] I [MSGID: 114018]

> [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from

> data02-client-1. Client process will keep trying to connect to glusterd

> until brick's port is available

> [2016-01-29 05:03:47.170284] W [socket.c:588:__socket_rwv] 0-glusterfs:

> readv on 127.0.0.1:24007 <http://127.0.0.1:24007/> failed (No data

> available)

> [2016-01-29 05:03:47.639163] W [socket.c:588:__socket_rwv]

> 0-data02-client-0: readv on 10.1.1.11:49153

> <http://10.1.1.11:49153/> failed (No data available)

> [2016-01-29 05:03:47.639206] I [MSGID: 114018]

> [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from

> data02-client-0. Client process will keep trying to connect to glusterd

> until brick's port is available

> [2016-01-29 05:03:47.640222] E [MSGID: 108006]

> [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are

> down. Going offline until atleast one of them comes back up.

> [2016-01-29 05:03:57.872983] W [glusterfsd.c:1236:cleanup_and_exit]

> (-->/lib64/libpthread.so.0() [0x3e47a079d1]

> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e6d]

> -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059d5] ) 0-: rec

> eived signum (15), shutting down

> [2016-01-29 05:03:58.881541] I [MSGID: 100030] [glusterfsd.c:2318:main]

> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.6

> (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd

> -p /var/lib/glusterd/

> glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S

> /var/run/gluster/8d72de580ccac07d2ecfc2491a9b1648.socket --xlator-option

> *replicate*.node-uuid=9b103ea8-d248-44fc-8f80-3e87f7c4971c)

> [2016-01-29 05:03:58.890833] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 1

> [2016-01-29 05:03:59.340030] I [graph.c:269:gf_add_cmdline_options]

> 0-data02-replicate-0: adding option 'node-uuid' for volume

> 'data02-replicate-0' with value '9b103ea8-d248-44fc-8f80-3e87f7c4971c'

> [2016-01-29 05:03:59.342682] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 2

> [2016-01-29 05:03:59.342742] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 3

> [2016-01-29 05:03:59.342827] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 4

> [2016-01-29 05:03:59.342892] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 5

> [2016-01-29 05:03:59.342917] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 6

> [2016-01-29 05:03:59.343563] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 8

> [2016-01-29 05:03:59.343569] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 7

> [2016-01-29 05:03:59.343657] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 9

> [2016-01-29 05:03:59.343705] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 11

> [2016-01-29 05:03:59.343710] I [MSGID: 101190]

> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> with index 10

> [2016-01-29 05:03:59.344278] I [MSGID: 114020] [client.c:2118:notify]

> 0-data02-client-0: parent translators are ready, attempting connect on

> transport

> [2016-01-29 05:03:59.346553] I [MSGID: 114020] [client.c:2118:notify]

> 0-data02-client-1: parent translators are ready, attempting connect on

> transport

> Final graph:

> +------------------------------------------------------------------------------+

>   1: volume data02-client-0

>   2:     type protocol/client

>   3:     option ping-timeout 42

>   4:     option remote-host gluster-stor01

>   5:     option remote-subvolume /export/data/brick02

>   6:     option transport-type socket

>   7:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452

>   8:     option password 66b85782-5833-4f2d-ad0e-8de75247b094F

>   9:     option event-threads 11

>  10: end-volume

>  11:

>  12: volume data02-client-1

>  13:     type protocol/client

>  14:     option ping-timeout 42

>  15:     option remote-host gluster-stor02

>  16:     option remote-subvolume /export/data/brick02

>  17:     option transport-type socket

>  18:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452

>  19:     option password 66b85782-5833-4f2d-ad0e-8de75247b094

>  20:     option event-threads 11

>  21: end-volume

>  22:

>  23: volume data02-replicate-0

>  24:     type cluster/replicate

>  25:     option node-uuid 9b103ea8-d248-44fc-8f80-3e87f7c4971c

>  26:     option background-self-heal-count 0

>  27:     option metadata-self-heal on

>  28:     option data-self-heal on

>  29:     option entry-self-heal on

>  30:     option self-heal-daemon enable

>  31:     option iam-self-heal-daemon yes

>  32:     subvolumes data02-client-0 data02-client-1

>  33: end-volume

>  34:

>  35: volume glustershd

>  36:     type debug/io-stats

>  37:     subvolumes data02-replicate-0

>  38: end-volume

>  39:

> +------------------------------------------------------------------------------+

> [2016-01-29 05:03:59.348913] E [MSGID: 114058]

> [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-1:

> failed to get the port number for remote subvolume. Please run 'gluster

> volume status' on server to see if brick process

> is running.

> [2016-01-29 05:03:59.348960] I [MSGID: 114018]

> [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from

> data02-client-1. Client process will keep trying to connect to glusterd

> until brick's port is available

> [2016-01-29 05:03:59.436909] E [MSGID: 114058]

> [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-0:

> failed to get the port number for remote subvolume. Please run 'gluster

> volume status' on server to see if brick process

> is running.

> [2016-01-29 05:03:59.436974] I [MSGID: 114018]

> [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from

> data02-client-0. Client process will keep trying to connect to glusterd

> until brick's port is available

> [2016-01-29 05:03:59.436991] E [MSGID: 108006]

> [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are

> down. Going offline until atleast one of them comes back up.

> [2016-01-29 05:04:02.886317] I [rpc-clnt.c:1847:rpc_clnt_reconfig]

> 0-data02-client-0: changing port to 49153 (from 0)

> [2016-01-29 05:04:02.888761] I [rpc-clnt.c:1847:rpc_clnt_reconfig]

> 0-data02-client-1: changing port to 49155 (from 0)

> [2016-01-29 05:04:02.891105] I [MSGID: 114057]

> [client-handshake.c:1437:select_server_supported_programs]

> 0-data02-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)

> [2016-01-29 05:04:02.891360] I [MSGID: 114046]

> [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-0:

> Connected to data02-client-0, attached to remote volume

> '/export/data/brick02'.

> [2016-01-29 05:04:02.891373] I [MSGID: 114047]

> [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-0: Server

> and Client lk-version numbers are not same, reopening the fds

> [2016-01-29 05:04:02.891403] I [MSGID: 108005]

> [afr-common.c:3841:afr_notify] 0-data02-replicate-0: Subvolume

> 'data02-client-0' came back up; going online.

> [2016-01-29 05:04:02.891518] I [MSGID: 114035]

> [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-0:

> Server lk version = 1

> [2016-01-29 05:04:02.893074] I [MSGID: 114057]

> [client-handshake.c:1437:select_server_supported_programs]

> 0-data02-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)

> [2016-01-29 05:04:02.893251] I [MSGID: 114046]

> [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-1:

> Connected to data02-client-1, attached to remote volume

> '/export/data/brick02'.

> [2016-01-29 05:04:02.893276] I [MSGID: 114047]

> [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-1: Server

> and Client lk-version numbers are not same, reopening the fds

> [2016-01-29 05:04:02.893401] I [MSGID: 114035]

> [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-1:

> Server lk version = 1

>

>

> _______________________________________________

> Gluster-devel mailing list

> Gluster-devel@xxxxxxxxxxx

> http://www.gluster.org/mailman/listinfo/gluster-devel

>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel