Re: Losing connection to bricks, Gluster processes restarting

Logan Barfield <lbarfield@xxxxxxxxxxxxx> · Tue, 2 Feb 2016 11:55:31 -0500

Not historically, but we are using bonding for replication between the servers.  It's been stable for at least 6 months, but it's possible that one of the links in the bond is failing or something.
Would this type of restart be triggered by a loss of communication between bricks in a replica set?  It seems like it would defeat one of the points of having a replicated volume if that were the case.

Thank You,

Logan Barfield
Tranquil Hosting

On Tue, Feb 2, 2016 at 12:02 AM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Initially I was suspecting about server-quorum be the culprit which is

not the case. By any chance is your network flaky?

On 02/01/2016 10:33 PM, Logan Barfield wrote:

> Volume Name: data02

> Type: Replicate

> Volume ID: 1c8928b1-f49e-4950-be06-0f8ce5adf870

> Status: Started

> Number of Bricks: 1 x 2 = 2

> Transport-type: tcp

> Bricks:

> Brick1: gluster-stor01:/export/data/brick02   <-- 10.1.1.10

> Brick2: gluster-stor02:/export/data/brick02   <-- 10.1.1.11

> Options Reconfigured:

> server.event-threads: 5

> client.event-threads: 11

> geo-replication.indexing: on

> geo-replication.ignore-pid-check: on

> changelog.changelog: on

> server.statedump-path: /tmp

> server.outstanding-rpc-limit: 128

> performance.io-thread-count: 64

> performance.nfs.read-ahead: on

> performance.nfs.io-cache: on

> performance.nfs.quick-read: on

> performance.cache-max-file-size: 1MB

> performance.client-io-threads: on

> cluster.lookup-optimize: on

> performance.cache-size: 1073741824

> performance.write-behind-window-size: 4MB

> performance.nfs.write-behind-window-size: 4MB

> performance.read-ahead: off

> performance.nfs.stat-prefetch: on

>

>

> Status of volume: data02

> Gluster process                             TCP Port  RDMA Port  Online  Pid

> ------------------------------------------------------------------------------

> Brick gluster-stor01:/export/data/brick02      49153     0          Y

>     17411

> Brick gluster-stor02:/export/data/brick02      49155     0          Y

>     4717

> NFS Server on localhost                     2049      0          Y

> 17395

> Self-heal Daemon on localhost               N/A       N/A        Y

> 17405

> NFS Server on gluster-stor02                   2049      0          Y

>     4701

> Self-heal Daemon on gluster-stor02             N/A       N/A        Y

>     4712

>

> Task Status of Volume data02

> ------------------------------------------------------------------------------

> There are no active volume tasks

>

>

>

> Note that this problem was occurring with the same frequency before we

> added all of the volume options above.  We were running defaults up

> until last week, and changing them had no impact on this particular problem.

>

>

>

>

> Thank You,

>

> Logan Barfield

> Tranquil Hosting

>

> On Fri, Jan 29, 2016 at 9:28 PM, Atin Mukherjee <amukherj@xxxxxxxxxx

> <mailto:amukherj@xxxxxxxxxx>> wrote:

>

>     Could you paste output of gluster volume info?

>

>     ~Atin

>

>     On 01/29/2016 11:59 PM, Logan Barfield wrote:

>     > We're running a fairly large 2-replica volume across two servers.  The

>     > volume is approximately 20TB of small 1K-4MB files.  The volume is

>     > exported via NFS, and mounted remotely by two clients.

>     >

>     > For the past few weeks the Gluster brick processes have been randomly

>     > restarting.  Luckily they've been doing so at non-peak times, so we

>     > didn't notice until our monitoring checks happened to pick up on

>     zombied

>     > 'glusterfs' process.

>     >

>     > From the logs it looks like something is blocking communication to the

>     > brick processes, and Gluster automatically restarts everything to

>     > compensate.  I've so far not been able to figure out the

>     underlying cause.

>     >

>     > I've included log snippets from 'glustershd.log' and

>     > 'etc-glusterfs-glusterd.vol.log' here.  If anyone can provide some

>     > insight into the issue it would be greatly appreciated.  I'll also be

>     > happy to provide any further details as needed.

>     >

>     >

>     > [2016-01-29 05:03:47.039886] I [MSGID: 106144]

>     > [glusterd-pmap.c:274:pmap_registry_remove] 0-pmap: removing brick

>     > /export/data/brick02 on port 49155

>     > [2016-01-29 05:03:47.075521] W [socket.c:588:__socket_rwv]

>     0-management:

>     > readv on /var/run/gluster/53a233b05f5d4be45dc94391bc3ebfe5.socket

>     failed

>     > (No data available)

>     > [2016-01-29 05:03:47.078282] I [MSGID: 106005]

>     > [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:

>     > Brick gluster-stor02:/export/data/brick02 has disconnected from

>     glusterd.

>     > [2016-01-29 05:03:47.149161] W [glusterfsd.c:1236:cleanup_and_exit]

>     > (-->/lib64/libpthread.so.0() [0x3e47a079d1]

>     > -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x405e6d]

>     > -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: recei

>     > ved signum (15), shutting down

>     > [2016-01-29 05:03:54.067012] I [MSGID: 100030]

>     [glusterfsd.c:2318:main]

>     > 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6

>     > (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)

>     > [2016-01-29 05:03:54.071901] I [MSGID: 106478] [glusterd.c:1350:init]

>     > 0-management: Maximum allowed open file descriptors set to 65536

>     > [2016-01-29 05:03:54.071935] I [MSGID: 106479] [glusterd.c:1399:init]

>     > 0-management: Using /var/lib/glusterd as working directory

>     > [2016-01-29 05:03:54.075655] E

>     [rpc-transport.c:292:rpc_transport_load]

>     > 0-rpc-transport: /usr/lib64/glusterfs/3.7.6/rpc-transport/rdma.so:

>     > cannot open shared object file: No such file or directory

>     > [2016-01-29 05:03:54.075672] W

>     [rpc-transport.c:296:rpc_transport_load]

>     > 0-rpc-transport: volume 'rdma.management': transport-type 'rdma'

>     is not

>     > valid or not found on this machine

>     > [2016-01-29 05:03:54.075680] W [rpcsvc.c:1597:rpcsvc_transport_create]

>     > 0-rpc-service: cannot create listener, initing the transport failed

>     > [2016-01-29 05:03:54.075687] E [MSGID: 106243] [glusterd.c:1623:init]

>     > 0-management: creation of 1 listeners failed, continuing with

>     succeeded

>     > transport

>     > [2016-01-29 05:03:55.869717] I [MSGID: 106513]

>     > [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd:

>     > retrieved op-version: 30702

>     > [2016-01-29 05:03:55.995747] I [MSGID: 106498]

>     > [glusterd-handler.c:3579:glusterd_friend_add_from_peerinfo]

>     > 0-management: connect returned 0

>     > [2016-01-29 05:03:55.995866] I

>     [rpc-clnt.c:984:rpc_clnt_connection_init]

>     > 0-management: setting frame-timeout to 600

>     > [2016-01-29 05:03:56.000937] I [MSGID: 106544]

>     > [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID:

>     > 9b103ea8-d248-44fc-8f80-3e87f7c4971c

>     > Final graph:

>     >

>     +------------------------------------------------------------------------------+

>     >   1: volume management

>     >   2:     type mgmt/glusterd

>     >   3:     option rpc-auth.auth-glusterfs on

>     >   4:     option rpc-auth.auth-unix on

>     >   5:     option rpc-auth.auth-null on

>     >   6:     option rpc-auth-allow-insecure on

>     >   7:     option transport.socket.listen-backlog 128

>     >   8:     option ping-timeout 30

>     >   9:     option transport.socket.read-fail-log off

>     >  10:     option transport.socket.keepalive-interval 2

>     >  11:     option transport.socket.keepalive-time 10

>     >  12:     option transport-type rdma

>     >  13:     option working-directory /var/lib/glusterd

>     >  14: end-volume

>     >  15:

>     >

>     +------------------------------------------------------------------------------+

>     > [2016-01-29 05:03:56.002570] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 2

>     > [2016-01-29 05:03:56.003098] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 1

>     > [2016-01-29 05:03:56.003158] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 2

>     > [2016-01-29 05:03:56.855628] I [MSGID: 106493]

>     > [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd:

>     Received

>     > ACC from uuid: 388a8bb4-c530-44ff-838b-8f7b9e4c95db, host: 10.1.1.10,

>     > port: 0

>     > [2016-01-29 05:03:56.856787] I

>     [rpc-clnt.c:984:rpc_clnt_connection_init]

>     > 0-nfs: setting frame-timeout to 600

>     > [2016-01-29 05:03:57.859093] I [MSGID: 106540]

>     > [glusterd-utils.c:4191:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered MOUNTV3 successfully

>     > [2016-01-29 05:03:57.860228] I [MSGID: 106540]

>     > [glusterd-utils.c:4200:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered MOUNTV1 successfully

>     > [2016-01-29 05:03:57.861329] I [MSGID: 106540]

>     > [glusterd-utils.c:4209:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered NFSV3 successfully

>     > [2016-01-29 05:03:57.862421] I [MSGID: 106540]

>     > [glusterd-utils.c:4218:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered NLM v4 successfully

>     > [2016-01-29 05:03:57.863510] I [MSGID: 106540]

>     > [glusterd-utils.c:4227:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered NLM v1 successfully

>     > [2016-01-29 05:03:57.864600] I [MSGID: 106540]

>     > [glusterd-utils.c:4236:glusterd_nfs_pmap_deregister] 0-glusterd:

>     > De-registered ACL v3 successfully

>     > [2016-01-29 05:03:57.870948] W [socket.c:3009:socket_connect] 0-nfs:

>     > Ignore failed connection attempt on , (No such file or directory)

>     >

>     >

>     >

>     >

>     >

>     >

>     >

>     > [2016-01-29 05:03:47.075614] W [socket.c:588:__socket_rwv]

>     > 0-data02-client-1: readv on 10.1.1.10:49155 <http://10.1.1.10:49155>

>     > <http://10.1.1.10:49155/> failed (No data available)

>     > [2016-01-29 05:03:47.076871] I [MSGID: 114018]

>     > [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from

>     > data02-client-1. Client process will keep trying to connect to glusterd

>     > until brick's port is available

>     > [2016-01-29 05:03:47.170284] W [socket.c:588:__socket_rwv] 0-glusterfs:

>     > readv on 127.0.0.1:24007 <http://127.0.0.1:24007>

>     <http://127.0.0.1:24007/> failed (No data

>     > available)

>     > [2016-01-29 05:03:47.639163] W [socket.c:588:__socket_rwv]

>     > 0-data02-client-0: readv on 10.1.1.11:49153 <http://10.1.1.11:49153>

>     > <http://10.1.1.11:49153/> failed (No data available)

>     > [2016-01-29 05:03:47.639206] I [MSGID: 114018]

>     > [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from

>     > data02-client-0. Client process will keep trying to connect to

>     glusterd

>     > until brick's port is available

>     > [2016-01-29 05:03:47.640222] E [MSGID: 108006]

>     > [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All

>     subvolumes are

>     > down. Going offline until atleast one of them comes back up.

>     > [2016-01-29 05:03:57.872983] W [glusterfsd.c:1236:cleanup_and_exit]

>     > (-->/lib64/libpthread.so.0() [0x3e47a079d1]

>     > -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e6d]

>     > -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059d5] ) 0-: rec

>     > eived signum (15), shutting down

>     > [2016-01-29 05:03:58.881541] I [MSGID: 100030]

>     [glusterfsd.c:2318:main]

>     > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version

>     3.7.6

>     > (args: /usr/sbin/glusterfs -s localhost --volfile-id

>     gluster/glustershd

>     > -p /var/lib/glusterd/

>     > glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S

>     > /var/run/gluster/8d72de580ccac07d2ecfc2491a9b1648.socket

>     --xlator-option

>     > *replicate*.node-uuid=9b103ea8-d248-44fc-8f80-3e87f7c4971c)

>     > [2016-01-29 05:03:58.890833] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 1

>     > [2016-01-29 05:03:59.340030] I [graph.c:269:gf_add_cmdline_options]

>     > 0-data02-replicate-0: adding option 'node-uuid' for volume

>     > 'data02-replicate-0' with value '9b103ea8-d248-44fc-8f80-3e87f7c4971c'

>     > [2016-01-29 05:03:59.342682] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 2

>     > [2016-01-29 05:03:59.342742] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 3

>     > [2016-01-29 05:03:59.342827] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 4

>     > [2016-01-29 05:03:59.342892] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 5

>     > [2016-01-29 05:03:59.342917] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 6

>     > [2016-01-29 05:03:59.343563] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 8

>     > [2016-01-29 05:03:59.343569] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 7

>     > [2016-01-29 05:03:59.343657] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 9

>     > [2016-01-29 05:03:59.343705] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 11

>     > [2016-01-29 05:03:59.343710] I [MSGID: 101190]

>     > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started

>     thread

>     > with index 10

>     > [2016-01-29 05:03:59.344278] I [MSGID: 114020] [client.c:2118:notify]

>     > 0-data02-client-0: parent translators are ready, attempting connect on

>     > transport

>     > [2016-01-29 05:03:59.346553] I [MSGID: 114020] [client.c:2118:notify]

>     > 0-data02-client-1: parent translators are ready, attempting connect on

>     > transport

>     > Final graph:

>     >

>     +------------------------------------------------------------------------------+

>     >   1: volume data02-client-0

>     >   2:     type protocol/client

>     >   3:     option ping-timeout 42

>     >   4:     option remote-host gluster-stor01

>     >   5:     option remote-subvolume /export/data/brick02

>     >   6:     option transport-type socket

>     >   7:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452

>     >   8:     option password 66b85782-5833-4f2d-ad0e-8de75247b094F

>     >   9:     option event-threads 11

>     >  10: end-volume

>     >  11:

>     >  12: volume data02-client-1

>     >  13:     type protocol/client

>     >  14:     option ping-timeout 42

>     >  15:     option remote-host gluster-stor02

>     >  16:     option remote-subvolume /export/data/brick02

>     >  17:     option transport-type socket

>     >  18:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452

>     >  19:     option password 66b85782-5833-4f2d-ad0e-8de75247b094

>     >  20:     option event-threads 11

>     >  21: end-volume

>     >  22:

>     >  23: volume data02-replicate-0

>     >  24:     type cluster/replicate

>     >  25:     option node-uuid 9b103ea8-d248-44fc-8f80-3e87f7c4971c

>     >  26:     option background-self-heal-count 0

>     >  27:     option metadata-self-heal on

>     >  28:     option data-self-heal on

>     >  29:     option entry-self-heal on

>     >  30:     option self-heal-daemon enable

>     >  31:     option iam-self-heal-daemon yes

>     >  32:     subvolumes data02-client-0 data02-client-1

>     >  33: end-volume

>     >  34:

>     >  35: volume glustershd

>     >  36:     type debug/io-stats

>     >  37:     subvolumes data02-replicate-0

>     >  38: end-volume

>     >  39:

>     >

>     +------------------------------------------------------------------------------+

>     > [2016-01-29 05:03:59.348913] E [MSGID: 114058]

>     > [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-1:

>     > failed to get the port number for remote subvolume. Please run

>     'gluster

>     > volume status' on server to see if brick process

>     > is running.

>     > [2016-01-29 05:03:59.348960] I [MSGID: 114018]

>     > [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from

>     > data02-client-1. Client process will keep trying to connect to

>     glusterd

>     > until brick's port is available

>     > [2016-01-29 05:03:59.436909] E [MSGID: 114058]

>     > [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-0:

>     > failed to get the port number for remote subvolume. Please run

>     'gluster

>     > volume status' on server to see if brick process

>     > is running.

>     > [2016-01-29 05:03:59.436974] I [MSGID: 114018]

>     > [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from

>     > data02-client-0. Client process will keep trying to connect to

>     glusterd

>     > until brick's port is available

>     > [2016-01-29 05:03:59.436991] E [MSGID: 108006]

>     > [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All

>     subvolumes are

>     > down. Going offline until atleast one of them comes back up.

>     > [2016-01-29 05:04:02.886317] I [rpc-clnt.c:1847:rpc_clnt_reconfig]

>     > 0-data02-client-0: changing port to 49153 (from 0)

>     > [2016-01-29 05:04:02.888761] I [rpc-clnt.c:1847:rpc_clnt_reconfig]

>     > 0-data02-client-1: changing port to 49155 (from 0)

>     > [2016-01-29 05:04:02.891105] I [MSGID: 114057]

>     > [client-handshake.c:1437:select_server_supported_programs]

>     > 0-data02-client-0: Using Program GlusterFS 3.3, Num (1298437),

>     Version (330)

>     > [2016-01-29 05:04:02.891360] I [MSGID: 114046]

>     > [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-0:

>     > Connected to data02-client-0, attached to remote volume

>     > '/export/data/brick02'.

>     > [2016-01-29 05:04:02.891373] I [MSGID: 114047]

>     > [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-0:

>     Server

>     > and Client lk-version numbers are not same, reopening the fds

>     > [2016-01-29 05:04:02.891403] I [MSGID: 108005]

>     > [afr-common.c:3841:afr_notify] 0-data02-replicate-0: Subvolume

>     > 'data02-client-0' came back up; going online.

>     > [2016-01-29 05:04:02.891518] I [MSGID: 114035]

>     > [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-0:

>     > Server lk version = 1

>     > [2016-01-29 05:04:02.893074] I [MSGID: 114057]

>     > [client-handshake.c:1437:select_server_supported_programs]

>     > 0-data02-client-1: Using Program GlusterFS 3.3, Num (1298437),

>     Version (330)

>     > [2016-01-29 05:04:02.893251] I [MSGID: 114046]

>     > [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-1:

>     > Connected to data02-client-1, attached to remote volume

>     > '/export/data/brick02'.

>     > [2016-01-29 05:04:02.893276] I [MSGID: 114047]

>     > [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-1:

>     Server

>     > and Client lk-version numbers are not same, reopening the fds

>     > [2016-01-29 05:04:02.893401] I [MSGID: 114035]

>     > [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-1:

>     > Server lk version = 1

>     >

>     >

>     > _______________________________________________

>     > Gluster-devel mailing list

>     > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>

>     > http://www.gluster.org/mailman/listinfo/gluster-devel

>     >

>

>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel